bitsandbytes library. This enables faster training with lower memory usage while maintaining accuracy, particularly beneficial for large models like ViT-Huge.
Overview
Int8 training replaces standard linear layers with 8-bit quantized versions that:- Reduce memory usage for weights and activations
- Accelerate matrix multiplications
- Maintain numerical stability through specialized quantization schemes
- Preserve accuracy with minimal degradation
Requirements
Install the bitsandbytes library:Basic Usage
Enable int8 training with the--use-bnb-linear flag:
Available Linear Layer Types
OpenCLIP supports two int8 linear layer implementations from bitsandbytes:SwitchBackLinearGlobal
Standard 8-bit linear layer with switchback optimization:- Good balance of speed and memory efficiency
- Recommended for most use cases
- Stable gradient computation
- Works well with all model sizes
SwitchBackLinearGlobalMemEfficient
Memory-optimized 8-bit linear layer:- Further reduces memory usage
- Slightly slower than standard version
- Best for very large models or limited memory
- Useful when training huge models (ViT-H, ViT-g)
Performance Benefits
Training Speed
ViT-Huge Model:- Standard training: baseline
- Int8 training: ~10% faster
- Expected improvement: 1.1x speedup
- Reduced weight storage (8-bit vs 16/32-bit)
- Lower activation memory
- Enables larger batch sizes
- Can train larger models on same hardware
Accuracy
Int8 training maintains accuracy:- No significant accuracy degradation observed
- Contrastive learning is robust to quantization
- Zero-shot performance remains comparable
- Fine-tuning results are preserved
Examples
Training ViT-B-32 with Int8
Training ViT-L-14 with Int8
Training ViT-H-14 with Memory-Efficient Int8
Combining with Other Optimizations
Int8 training works well with other memory and speed optimizations:With Mixed Precision
With Gradient Checkpointing
With Gradient Accumulation
With Distributed Training
Int8 Inference
You can also load and use int8 models for inference:Tutorial Notebook
For a detailed walkthrough of int8 training and inference, see the tutorial notebook:- Setting up int8 training
- Comparing performance with standard training
- Memory usage analysis
- Accuracy evaluation
- Inference optimization
- Best practices
Current Limitations
Attention Layers
Currently, only linear layers are replaced with int8 versions. Attention layers still use standard precision. Future improvements will include:- Int8 attention layers (coming soon)
- Further speedups when attention is refactored
- Full model quantization
Platform Support
- Supported: NVIDIA GPUs with CUDA
- Not Supported: CPU, AMD GPUs, Apple Silicon
- Requires CUDA-compatible bitsandbytes installation
Optimizer State
Optimizer states (Adam, AdamW) still use higher precision:- Int8 only applies to model weights
- Gradients are computed in higher precision
- Optimizer momentum and variance use fp32
When to Use Int8
Recommended For:
-
Large Models
- ViT-Huge and larger
- Models that are close to memory limits
- When you want to increase batch size
-
Limited GPU Memory
- Training on consumer GPUs (RTX 3090, 4090)
- Maximizing model size on available hardware
- Enabling larger experiments
-
Speed-Critical Training
- When 10% speedup matters
- Large-scale training runs
- Cost-sensitive training
Not Necessary For:
-
Small Models (ViT-B-32, ResNet-50)
- Limited benefit for smaller models
- Standard training is already fast enough
-
Abundant Memory
- If memory is not a constraint
- When using small batch sizes
-
Maximum Precision Needed
- Research requiring exact reproducibility
- When numerical precision is critical
Best Practices
-
Start with SwitchBackLinearGlobal
- Good default choice for most use cases
- Balance of speed and memory
-
Use with Mixed Precision
- Combine
--use-bnb-linearwith--precision amp - Maximizes speed benefits
- Combine
-
Monitor Accuracy
- Run regular zero-shot evaluations
- Compare with baseline runs
- Check final model performance
-
Test Before Large Runs
- Validate int8 training on small dataset first
- Ensure stability and convergence
- Measure actual speedup on your hardware
-
Enable for Large Models
- Most beneficial for ViT-L and larger
- Use SwitchBackLinearGlobalMemEfficient for ViT-H/ViT-g
Troubleshooting
Import Error
CUDA Error
Slower Than Expected
- Ensure CUDA is properly installed
- Check GPU utilization (should be high)
- Verify mixed precision is enabled (
--precision amp) - Some models benefit more than others
Numerical Issues
- Increase warmup:
--warmup 5000 - Reduce learning rate:
--lr 5e-4 - Enable gradient clipping:
--grad-clip-norm 1.0 - Try SwitchBackLinearGlobal instead of MemEfficient version
