Stratified k-fold

The StratifiedKFold class template generates stratified k-fold cross-validation splits for classification problems. It partitions sample indices into k folds while preserving the class distribution in each fold, which is critical for imbalanced datasets.

Template parameters

Label

typename

default:"std::size_t"

Integer-like class label type (e.g., int, std::size_t, enum). Must be convertible to std::size_t.

Type aliases

using Indices = std::vector<std::size_t>;
using Split   = std::pair<Indices, Indices>;  // {train indices, val indices}

Constructor

explicit StratifiedKFold(std::size_t n_splits = 5,
                         bool        shuffle   = false,
                         std::size_t seed      = 0);

Creates a stratified k-fold splitter.

n_splits

std::size_t

default:"5"

Number of folds k. Must be at least 2.

shuffle

bool

default:"false"

Whether to shuffle samples within each class before assigning to folds. If false, assignment is deterministic.

seed

std::size_t

default:"0"

Random seed used when shuffle = true. Ignored if shuffle = false.

Methods

split

[[nodiscard]] std::vector<Split>
split(const std::vector<Label>& labels) const;

Generates all k train/validation index splits.

labels

const std::vector<Label>&

Class label for each sample, length n_samples. Labels can be any integer-like values.

return

std::vector<Split>

Vector of k {train_indices, val_indices} pairs. Indices refer to positions in the input labels vector.

The stratification strategy:

Groups indices by class label
Within each class, shuffles (if shuffle = true), then assigns indices round-robin across folds
Each split returns the union of all other folds as training set and the held-out fold as validation set

This guarantees that for a class with m samples, each fold receives either ⌊m/k⌋ or ⌈m/k⌉ samples, minimizing imbalance.

n_classes

[[nodiscard]] std::size_t n_classes() const noexcept;

return

std::size_t

Number of unique classes found in the last call to split(). Returns 0 if split() has not been called.

n_splits

[[nodiscard]] std::size_t n_splits() const noexcept;

return

std::size_t

Number of folds k

Example usage

Basic usage

#include <mlpp/model_validation/stratified_kfold.hpp>

using namespace mlpp::model_validation;

// Sample labels (imbalanced dataset)
std::vector<int> labels = {0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2};

// Create 3-fold stratified splitter
StratifiedKFold<int> kfold(3, true, 42);

// Generate splits
auto splits = kfold.split(labels);

std::cout << "Number of classes: " << kfold.n_classes() << std::endl;
std::cout << "Number of folds: " << kfold.n_splits() << std::endl;

// Iterate through folds
for (size_t fold = 0; fold < splits.size(); ++fold) {
    const auto& [train_idx, val_idx] = splits[fold];
    
    std::cout << "\nFold " << fold << ":\n";
    std::cout << "  Train size: " << train_idx.size() << std::endl;
    std::cout << "  Val size: " << val_idx.size() << std::endl;
    
    // Access training samples
    for (size_t i : train_idx) {
        // Use labels[i] and corresponding features
    }
    
    // Access validation samples
    for (size_t i : val_idx) {
        // Use labels[i] and corresponding features
    }
}

Cross-validation loop

#include <mlpp/model_validation/stratified_kfold.hpp>

using namespace mlpp::model_validation;

std::vector<int> labels = /* ... */;
std::vector<std::vector<double>> features = /* ... */;

// Create 5-fold stratified CV
StratifiedKFold<int> kfold(5, true, 12345);
auto splits = kfold.split(labels);

std::vector<double> fold_scores;

for (size_t fold = 0; fold < splits.size(); ++fold) {
    const auto& [train_idx, val_idx] = splits[fold];
    
    // Prepare train/val data
    std::vector<std::vector<double>> X_train, X_val;
    std::vector<int> y_train, y_val;
    
    for (size_t i : train_idx) {
        X_train.push_back(features[i]);
        y_train.push_back(labels[i]);
    }
    
    for (size_t i : val_idx) {
        X_val.push_back(features[i]);
        y_val.push_back(labels[i]);
    }
    
    // Train model
    MyModel model;
    model.fit(X_train, y_train);
    
    // Evaluate
    double score = model.score(X_val, y_val);
    fold_scores.push_back(score);
    
    std::cout << "Fold " << fold << " score: " << score << std::endl;
}

// Compute mean CV score
double mean_score = std::accumulate(fold_scores.begin(), 
                                   fold_scores.end(), 0.0) / fold_scores.size();
std::cout << "Mean CV score: " << mean_score << std::endl;

Deterministic splits

// Create splitter without shuffling for reproducible splits
StratifiedKFold<size_t> kfold(4, false);  // deterministic

std::vector<size_t> labels = {0, 0, 1, 1, 2, 2, 3, 3};
auto splits = kfold.split(labels);

// Splits will be identical across runs

Properties

Stratification guarantee: For a class with m samples and k folds, each fold receives either ⌊m/k⌋ or ⌈m/k⌉ samples from that class. This is the minimum possible imbalance. Round-robin assignment: Within each class, samples are assigned to folds in round-robin fashion. Fold 0 gets indices 0, k, 2k, …; fold 1 gets indices 1, k+1, 2k+1, …, etc. Class preservation: The proportion of each class in every fold approximates the overall class distribution in the dataset. No overlap: Training and validation sets for each fold are disjoint. Each sample appears in exactly one validation set across all folds.

Regression

Classification

Decision Trees

Clustering

Dimensionality Reduction

Kernels

Model Validation

Loss Functions

Stratified k-fold

Template parameters

Type aliases

Constructor

Methods

split

n_classes

n_splits

Example usage

Basic usage

Cross-validation loop

Deterministic splits

Properties

Build docs developers (and LLMs) love

Regression

Classification

Decision Trees

Clustering

Dimensionality Reduction

Kernels

Model Validation

Loss Functions

​Template parameters

​Type aliases

​Constructor

​Methods

​split

​n_classes

​n_splits

​Example usage

​Basic usage

​Cross-validation loop

​Deterministic splits

​Properties

Build docs developers (and LLMs) love

Template parameters

Type aliases

Constructor

Methods

split

n_classes

n_splits

Example usage

Basic usage

Cross-validation loop

Deterministic splits

Properties