The StratifiedKFold class template generates stratified k-fold cross-validation splits for classification problems. It partitions sample indices into k folds while preserving the class distribution in each fold, which is critical for imbalanced datasets.
Template parameters
Label
typename
default:"std::size_t"
Integer-like class label type (e.g., int, std::size_t, enum). Must be convertible to std::size_t.
Type aliases
using Indices = std::vector<std::size_t>;
using Split = std::pair<Indices, Indices>; // {train indices, val indices}
Constructor
explicit StratifiedKFold(std::size_t n_splits = 5,
bool shuffle = false,
std::size_t seed = 0);
Creates a stratified k-fold splitter.
Number of folds k. Must be at least 2.
Whether to shuffle samples within each class before assigning to folds. If false, assignment is deterministic.
Random seed used when shuffle = true. Ignored if shuffle = false.
Methods
split
[[nodiscard]] std::vector<Split>
split(const std::vector<Label>& labels) const;
Generates all k train/validation index splits.
labels
const std::vector<Label>&
Class label for each sample, length n_samples. Labels can be any integer-like values.
Vector of k {train_indices, val_indices} pairs. Indices refer to positions in the input labels vector.
The stratification strategy:
- Groups indices by class label
- Within each class, shuffles (if
shuffle = true), then assigns indices round-robin across folds
- Each split returns the union of all other folds as training set and the held-out fold as validation set
This guarantees that for a class with m samples, each fold receives either ⌊m/k⌋ or ⌈m/k⌉ samples, minimizing imbalance.
n_classes
[[nodiscard]] std::size_t n_classes() const noexcept;
Number of unique classes found in the last call to split(). Returns 0 if split() has not been called.
n_splits
[[nodiscard]] std::size_t n_splits() const noexcept;
Example usage
Basic usage
#include <mlpp/model_validation/stratified_kfold.hpp>
using namespace mlpp::model_validation;
// Sample labels (imbalanced dataset)
std::vector<int> labels = {0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2};
// Create 3-fold stratified splitter
StratifiedKFold<int> kfold(3, true, 42);
// Generate splits
auto splits = kfold.split(labels);
std::cout << "Number of classes: " << kfold.n_classes() << std::endl;
std::cout << "Number of folds: " << kfold.n_splits() << std::endl;
// Iterate through folds
for (size_t fold = 0; fold < splits.size(); ++fold) {
const auto& [train_idx, val_idx] = splits[fold];
std::cout << "\nFold " << fold << ":\n";
std::cout << " Train size: " << train_idx.size() << std::endl;
std::cout << " Val size: " << val_idx.size() << std::endl;
// Access training samples
for (size_t i : train_idx) {
// Use labels[i] and corresponding features
}
// Access validation samples
for (size_t i : val_idx) {
// Use labels[i] and corresponding features
}
}
Cross-validation loop
#include <mlpp/model_validation/stratified_kfold.hpp>
using namespace mlpp::model_validation;
std::vector<int> labels = /* ... */;
std::vector<std::vector<double>> features = /* ... */;
// Create 5-fold stratified CV
StratifiedKFold<int> kfold(5, true, 12345);
auto splits = kfold.split(labels);
std::vector<double> fold_scores;
for (size_t fold = 0; fold < splits.size(); ++fold) {
const auto& [train_idx, val_idx] = splits[fold];
// Prepare train/val data
std::vector<std::vector<double>> X_train, X_val;
std::vector<int> y_train, y_val;
for (size_t i : train_idx) {
X_train.push_back(features[i]);
y_train.push_back(labels[i]);
}
for (size_t i : val_idx) {
X_val.push_back(features[i]);
y_val.push_back(labels[i]);
}
// Train model
MyModel model;
model.fit(X_train, y_train);
// Evaluate
double score = model.score(X_val, y_val);
fold_scores.push_back(score);
std::cout << "Fold " << fold << " score: " << score << std::endl;
}
// Compute mean CV score
double mean_score = std::accumulate(fold_scores.begin(),
fold_scores.end(), 0.0) / fold_scores.size();
std::cout << "Mean CV score: " << mean_score << std::endl;
Deterministic splits
// Create splitter without shuffling for reproducible splits
StratifiedKFold<size_t> kfold(4, false); // deterministic
std::vector<size_t> labels = {0, 0, 1, 1, 2, 2, 3, 3};
auto splits = kfold.split(labels);
// Splits will be identical across runs
Properties
Stratification guarantee: For a class with m samples and k folds, each fold receives either ⌊m/k⌋ or ⌈m/k⌉ samples from that class. This is the minimum possible imbalance.
Round-robin assignment: Within each class, samples are assigned to folds in round-robin fashion. Fold 0 gets indices 0, k, 2k, …; fold 1 gets indices 1, k+1, 2k+1, …, etc.
Class preservation: The proportion of each class in every fold approximates the overall class distribution in the dataset.
No overlap: Training and validation sets for each fold are disjoint. Each sample appears in exactly one validation set across all folds.