Skip to main content
The StratifiedKFold class template generates stratified k-fold cross-validation splits for classification problems. It partitions sample indices into k folds while preserving the class distribution in each fold, which is critical for imbalanced datasets.

Template parameters

Label
typename
default:"std::size_t"
Integer-like class label type (e.g., int, std::size_t, enum). Must be convertible to std::size_t.

Type aliases

using Indices = std::vector<std::size_t>;
using Split   = std::pair<Indices, Indices>;  // {train indices, val indices}

Constructor

explicit StratifiedKFold(std::size_t n_splits = 5,
                         bool        shuffle   = false,
                         std::size_t seed      = 0);
Creates a stratified k-fold splitter.
n_splits
std::size_t
default:"5"
Number of folds k. Must be at least 2.
shuffle
bool
default:"false"
Whether to shuffle samples within each class before assigning to folds. If false, assignment is deterministic.
seed
std::size_t
default:"0"
Random seed used when shuffle = true. Ignored if shuffle = false.

Methods

split

[[nodiscard]] std::vector<Split>
split(const std::vector<Label>& labels) const;
Generates all k train/validation index splits.
labels
const std::vector<Label>&
Class label for each sample, length n_samples. Labels can be any integer-like values.
return
std::vector<Split>
Vector of k {train_indices, val_indices} pairs. Indices refer to positions in the input labels vector.
The stratification strategy:
  1. Groups indices by class label
  2. Within each class, shuffles (if shuffle = true), then assigns indices round-robin across folds
  3. Each split returns the union of all other folds as training set and the held-out fold as validation set
This guarantees that for a class with m samples, each fold receives either ⌊m/k⌋ or ⌈m/k⌉ samples, minimizing imbalance.

n_classes

[[nodiscard]] std::size_t n_classes() const noexcept;
return
std::size_t
Number of unique classes found in the last call to split(). Returns 0 if split() has not been called.

n_splits

[[nodiscard]] std::size_t n_splits() const noexcept;
return
std::size_t
Number of folds k

Example usage

Basic usage

#include <mlpp/model_validation/stratified_kfold.hpp>

using namespace mlpp::model_validation;

// Sample labels (imbalanced dataset)
std::vector<int> labels = {0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2};

// Create 3-fold stratified splitter
StratifiedKFold<int> kfold(3, true, 42);

// Generate splits
auto splits = kfold.split(labels);

std::cout << "Number of classes: " << kfold.n_classes() << std::endl;
std::cout << "Number of folds: " << kfold.n_splits() << std::endl;

// Iterate through folds
for (size_t fold = 0; fold < splits.size(); ++fold) {
    const auto& [train_idx, val_idx] = splits[fold];
    
    std::cout << "\nFold " << fold << ":\n";
    std::cout << "  Train size: " << train_idx.size() << std::endl;
    std::cout << "  Val size: " << val_idx.size() << std::endl;
    
    // Access training samples
    for (size_t i : train_idx) {
        // Use labels[i] and corresponding features
    }
    
    // Access validation samples
    for (size_t i : val_idx) {
        // Use labels[i] and corresponding features
    }
}

Cross-validation loop

#include <mlpp/model_validation/stratified_kfold.hpp>

using namespace mlpp::model_validation;

std::vector<int> labels = /* ... */;
std::vector<std::vector<double>> features = /* ... */;

// Create 5-fold stratified CV
StratifiedKFold<int> kfold(5, true, 12345);
auto splits = kfold.split(labels);

std::vector<double> fold_scores;

for (size_t fold = 0; fold < splits.size(); ++fold) {
    const auto& [train_idx, val_idx] = splits[fold];
    
    // Prepare train/val data
    std::vector<std::vector<double>> X_train, X_val;
    std::vector<int> y_train, y_val;
    
    for (size_t i : train_idx) {
        X_train.push_back(features[i]);
        y_train.push_back(labels[i]);
    }
    
    for (size_t i : val_idx) {
        X_val.push_back(features[i]);
        y_val.push_back(labels[i]);
    }
    
    // Train model
    MyModel model;
    model.fit(X_train, y_train);
    
    // Evaluate
    double score = model.score(X_val, y_val);
    fold_scores.push_back(score);
    
    std::cout << "Fold " << fold << " score: " << score << std::endl;
}

// Compute mean CV score
double mean_score = std::accumulate(fold_scores.begin(), 
                                   fold_scores.end(), 0.0) / fold_scores.size();
std::cout << "Mean CV score: " << mean_score << std::endl;

Deterministic splits

// Create splitter without shuffling for reproducible splits
StratifiedKFold<size_t> kfold(4, false);  // deterministic

std::vector<size_t> labels = {0, 0, 1, 1, 2, 2, 3, 3};
auto splits = kfold.split(labels);

// Splits will be identical across runs

Properties

Stratification guarantee: For a class with m samples and k folds, each fold receives either ⌊m/k⌋ or ⌈m/k⌉ samples from that class. This is the minimum possible imbalance. Round-robin assignment: Within each class, samples are assigned to folds in round-robin fashion. Fold 0 gets indices 0, k, 2k, …; fold 1 gets indices 1, k+1, 2k+1, …, etc. Class preservation: The proportion of each class in every fold approximates the overall class distribution in the dataset. No overlap: Training and validation sets for each fold are disjoint. Each sample appears in exactly one validation set across all folds.

Build docs developers (and LLMs) love