Pipeline stages
The training pipeline consists of four sequential stages:Pretraining
Train the language model from random initialization on large text corpora (WikiText-2, OpenWebText, TinyStories, Wikipedia). The model learns basic language understanding and generation capabilities through next-token prediction.
Supervised fine-tuning (SFT)
Fine-tune the pretrained model on instruction-following datasets (Alpaca, Dolly, OpenOrca). The model learns to follow instructions and respond to user queries in a conversational format.
Direct preference optimization (DPO)
Further align the SFT model using preference data (Anthropic HH-RLHF). The model learns to prefer chosen responses over rejected ones, improving alignment with human preferences without requiring a reward model.
Quick start
Run the full pipeline
The simplest way to run the entire pipeline is with the unified runner script:Run individual stages
You can also run stages individually:Config presets
The pipeline comes with four built-in configuration presets optimized for different hardware and time constraints:local-smoke
Quick smoke test (~5 minutes) for validation:- Model: d=256, L=4, H=4 (~10M params)
- Steps: 100 pretrain / 50 SFT / 50 DPO / 50 verifier
- Hardware: Works on CPU or any GPU
- Use case: CI/CD testing, quick validation
local
Full training for consumer GPUs (RTX 3060) running ~24 hours:- Model: d=768, L=12, H=12 (~117M params)
- Steps: 20K pretrain / 5K SFT / 2K DPO / 3K verifier
- Hardware: RTX 3060 or better (12GB VRAM)
- Use case: Research experiments, local development
gpu-smoke
Quick GPU test (~10 minutes):- Model: d=256, L=4, H=4 (~10M params)
- Steps: 100 pretrain / 50 SFT / 50 DPO / 50 verifier
- Hardware: Any modern GPU
- Use case: Testing distributed training, GPU cluster validation
gpu
High-quality training for datacenter GPUs (A100/H100) running ~48 hours:- Model: d=1024, L=12, H=16 (~350M params)
- Steps: 80K pretrain / 10K SFT / 3K DPO / 3K verifier
- Datasets: Wikipedia + OpenWebText + WikiText-103 + TinyStories (100K)
- Hardware: A100/H100 with 40-80GB VRAM
- Use case: Production models, benchmark results
Configuration
Override hyperparameters
You can override specific hyperparameters via command-line arguments:Custom config files
For more control, create a custom JSON config file:configs/custom.json
Output structure
The pipeline creates the following directory structure:Each checkpoint (.pt file) contains:
model_state: Model weightsoptimizer_state: Optimizer state for resumptionconfig: Model architecture configstep: Training step number
Pipeline state
When running the full pipeline with--stage all, the runner saves a pipeline_state.json file tracking all checkpoint paths:
Next steps
Pretraining
Learn about the pretraining stage and dataset options
SFT
Understand supervised fine-tuning on instruction data
DPO
Explore preference alignment with DPO
Verifier
Train a verifier for answer correctness