Overview
OpenCLIP supports two main data formats for training:- CSV Format: Simple, good for small to medium datasets (<10M samples)
- WebDataset Format: Efficient, required for large-scale datasets (>10M samples)
CSV Format
CSV format is the simplest way to specify image-text pairs for training.Basic CSV Structure
Create a CSV or TSV file with image paths and corresponding captions:CSV Parameters
Control CSV parsing with these flags:--csv-separator: Column delimiter (default:\tfor tab)--csv-img-key: Column name for image paths (default:filepath)--csv-caption-key: Column name for captions (default:title)
CSV Format Examples
Tab-separated (default):CSV Best Practices
Good practices:
- Use absolute paths for image files
- Quote captions containing commas or special characters
- Keep CSV files on fast storage (SSD)
- Split large CSVs into train/val sets
Validation Data
Provide a separate CSV for validation:WebDataset Format
WebDataset is a streaming dataset format optimized for large-scale training. It stores data in.tar archives with efficient sequential access.
WebDataset Structure
A WebDataset consists of multiple.tar files, each containing paired image and text files:
- Each sample consists of an image file and a text file with the same name
- Files are grouped into
.tararchives (typically 1,000-10,000 samples each) - Archives can be stored locally or accessed remotely (S3, HTTP)
Creating WebDatasets with img2dataset
img2dataset is the recommended tool for converting image-text datasets to WebDataset format.Installation
Basic Usage
Input: Parquet or CSV file with image URLs and captionsConceptual Captions 3M Example
See the CC3M img2dataset example for a complete walkthrough:Training with WebDataset
Use the--dataset-type webdataset flag:
--train-data: Path with glob pattern for .tar files--train-num-samples: Total number of samples (required)--dataset-type webdataset: Specify WebDataset format
Glob Patterns for WebDataset
WebDataset supports bash-style glob patterns:Dataset Resampling
For large datasets, enable sampling with replacement:--dataset-resampled:
- Enables efficient epoch-based training on streaming datasets
- Allows training for fewer than one full epoch (via
--train-num-samples) - Required when using multiple data sources with upsampling
Partial Epochs for Large Datasets
For very large datasets (LAION-2B), train on a fraction of an epoch:Multiple Data Sources
Combine multiple datasets using the:: separator:
Data Upsampling Factors
By default, samples from each source are proportional to dataset size. Use--train-data-upsampling-factors to control weighting:
Combining LAION and CC12M Example
Remote Data Loading
WebDataset supports loading data from remote sources:AWS S3
HTTP/HTTPS
Google Cloud Storage
Remote data loading requires network bandwidth. For best performance:
- Use local storage when possible
- Ensure high-bandwidth network connection
- Monitor network utilization with
iftopornethogs
Data Format Specifications
Image Formats
Supported image formats:- JPEG (.jpg, .jpeg)
- PNG (.png)
- WebP (.webp)
- Any format supported by PIL/Pillow
- Use JPEG for photographs (smaller file size)
- Use PNG for images with transparency or text
- Store at reasonable resolution (224-512px for most models)
Text Formats
WebDataset: Plain text files (.txt)- Keep captions concise but descriptive
- Typical length: 5-20 words
- Remove special characters that may cause parsing issues
- Use UTF-8 encoding for international text
Data Quality Considerations
Image Quality
Good practices:
- Filter out corrupted or unreadable images
- Remove duplicates
- Ensure minimum resolution (e.g., 224x224)
- Verify aspect ratios are reasonable
- Check for NSFW content if needed
Caption Quality
Good practices:
- Remove empty or very short captions (<3 words)
- Filter out captions with excessive special characters
- Remove duplicate captions
- Consider language filtering for multilingual datasets
- Remove personally identifiable information (PII)
Dataset Size Recommendations
| Use Case | Dataset Size | Format | Example |
|---|---|---|---|
| Experimentation | 1M-10M | CSV or WebDataset | CC3M, CC12M |
| Small-scale training | 10M-100M | WebDataset | YFCC15M |
| Medium-scale | 100M-500M | WebDataset | LAION-400M |
| Large-scale | 500M+ | WebDataset | LAION-2B, DataComp-1B |
Example Datasets
Conceptual Captions 3M (CC3M)
Conceptual Captions 12M (CC12M)
LAION Datasets
LAION datasets are available as pre-processed WebDatasets: See LAION documentation for download instructions.Preprocessing and Augmentation
Image Preprocessing
Images are automatically preprocessed during training:- Resize to model input size (default: 224x224)
- Normalize with ImageNet statistics
- Data augmentation (optional)
Data Augmentation
Custom augmentation via--aug-cfg:
Storage Optimization
Tar File Size
Recommended tar file sizes:- 1,000-10,000 samples per .tar for local storage
- Smaller tars (1,000-5,000) for remote/networked storage
- Balance between random access and I/O efficiency
Compression
WebDataset .tar files can be compressed:SSD vs HDD
| Storage | Random Access | Sequential | Best For |
|---|---|---|---|
| NVMe SSD | Excellent | Excellent | CSV, small WebDataset |
| SATA SSD | Good | Good | CSV, WebDataset |
| HDD | Poor | Good | Large WebDataset (sequential reads) |
| Network | Variable | Variable | Remote WebDataset |
Troubleshooting
Slow Data Loading
Symptom: Low GPU utilization (<80%) Solutions:- Increase
--workers(more data loading processes) - Use faster storage (SSD instead of HDD)
- Convert CSV to WebDataset for large datasets
- Reduce image resolution in preprocessing
- Check network bandwidth for remote data
Corrupted Images
Symptom: Training crashes with PIL errors Solutions:- Filter corrupted images during preprocessing:
- Add error handling in custom data pipeline
- Validate all images before creating WebDataset
Missing Files
Symptom: FileNotFoundError during training Solutions:- Use absolute paths in CSV files
- Verify file permissions
- Check glob patterns for WebDataset
- Ensure all nodes have access to shared storage (multi-node)
Unbalanced Data Sources
Symptom: Model overfits to larger dataset Solutions:- Use
--train-data-upsampling-factorsto balance sampling - Create weighted mixture of datasets
- Train on larger dataset first, then fine-tune on smaller
Data Validation Script
Validate your WebDataset before training:Next Steps
Single-Node Training
Train on prepared data with single machine
Configuration
Configure training parameters and data augmentation
Distributed Training
Optimize data loading for distributed training
Training Overview
Return to training overview
