Skip to main content

Overview

OpenCLIP supports two main data formats for training:
  1. CSV Format: Simple, good for small to medium datasets (<10M samples)
  2. WebDataset Format: Efficient, required for large-scale datasets (>10M samples)
This guide covers how to prepare, format, and optimize your training data.

CSV Format

CSV format is the simplest way to specify image-text pairs for training.

Basic CSV Structure

Create a CSV or TSV file with image paths and corresponding captions:
filepath,title
/data/images/img001.jpg,"A photo of a cat sitting on a windowsill"
/data/images/img002.jpg,"A dog playing fetch in the park"
/data/images/img003.jpg,"Sunset over the ocean with sailboats"

CSV Parameters

Control CSV parsing with these flags:
python -m open_clip_train.main \
    --train-data "/path/to/train_data.csv" \
    --dataset-type csv \
    --csv-separator "\t" \
    --csv-img-key filepath \
    --csv-caption-key title \
    # ... other arguments
Parameters:
  • --csv-separator: Column delimiter (default: \t for tab)
  • --csv-img-key: Column name for image paths (default: filepath)
  • --csv-caption-key: Column name for captions (default: title)

CSV Format Examples

Tab-separated (default):
filepath\ttitle
/data/img1.jpg\tA cat
/data/img2.jpg\tA dog
Comma-separated:
image_path,caption
/data/img1.jpg,"A cat playing with yarn"
/data/img2.jpg,"A dog in the snow"
--csv-separator "," \
--csv-img-key image_path \
--csv-caption-key caption
Custom columns:
id,url,description,tags
1,/img/001.png,"Red sports car",vehicle
2,/img/002.png,"Mountain landscape",nature
--csv-img-key url \
--csv-caption-key description

CSV Best Practices

Good practices:
  • Use absolute paths for image files
  • Quote captions containing commas or special characters
  • Keep CSV files on fast storage (SSD)
  • Split large CSVs into train/val sets
Limitations:
  • CSV format is slow for datasets >10M samples
  • Random access is inefficient for shuffling
  • No built-in data sharding for distributed training
  • Recommendation: Use WebDataset for datasets >10M samples

Validation Data

Provide a separate CSV for validation:
python -m open_clip_train.main \
    --train-data "/data/train.csv" \
    --val-data "/data/val.csv" \
    --dataset-type csv \
    # ... other arguments

WebDataset Format

WebDataset is a streaming dataset format optimized for large-scale training. It stores data in .tar archives with efficient sequential access.

WebDataset Structure

A WebDataset consists of multiple .tar files, each containing paired image and text files:
cc12m-train-0000.tar
├── 000000.jpg
├── 000000.txt
├── 000001.jpg
├── 000001.txt
├── 000002.jpg
└── 000002.txt

cc12m-train-0001.tar
├── 001000.jpg
├── 001000.txt
└── ...
Key points:
  • Each sample consists of an image file and a text file with the same name
  • Files are grouped into .tar archives (typically 1,000-10,000 samples each)
  • Archives can be stored locally or accessed remotely (S3, HTTP)

Creating WebDatasets with img2dataset

img2dataset is the recommended tool for converting image-text datasets to WebDataset format.

Installation

pip install img2dataset

Basic Usage

Input: Parquet or CSV file with image URLs and captions
# input.parquet
url,caption
https://example.com/img1.jpg,"A cat"
https://example.com/img2.jpg,"A dog"
Convert to WebDataset:
img2dataset --url_list input.parquet \
    --input_format "parquet" \
    --url_col "url" \
    --caption_col "caption" \
    --output_folder dataset \
    --output_format webdataset \
    --processes_count 16 \
    --thread_count 64 \
    --image_size 256 \
    --resize_mode "shortest" \
    --resize_only_if_bigger True \
    --encode_format "jpg" \
    --encode_quality 95
Output:
dataset/
├── 00000.tar
├── 00001.tar
├── 00002.tar
└── ...

Conceptual Captions 3M Example

See the CC3M img2dataset example for a complete walkthrough:
# Download CC3M metadata
wget https://ai.google.com/research/ConceptualCaptions/download

# Convert to WebDataset
img2dataset --url_list cc3m.tsv \
    --input_format "tsv" \
    --url_col "url" \
    --caption_col "caption" \
    --output_folder cc3m_webdataset \
    --output_format webdataset \
    --processes_count 16 \
    --thread_count 64 \
    --image_size 224 \
    --resize_mode "center_crop"

Training with WebDataset

Use the --dataset-type webdataset flag:
python -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 256 \
    # ... other arguments
Key parameters:
  • --train-data: Path with glob pattern for .tar files
  • --train-num-samples: Total number of samples (required)
  • --dataset-type webdataset: Specify WebDataset format

Glob Patterns for WebDataset

WebDataset supports bash-style glob patterns:
# Range of files: 0000 to 2175
--train-data "/data/train-{0000..2175}.tar"

# Range with padding: 00000 to 41455
--train-data "/data/train-{00000..41455}.tar"

# Multiple directories
--train-data "/data/shard1/train-{0000..1000}.tar" \
             "/data/shard2/train-{0000..1000}.tar"

Dataset Resampling

For large datasets, enable sampling with replacement:
python -m open_clip_train.main \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    # ... other arguments
Benefits of --dataset-resampled:
  • Enables efficient epoch-based training on streaming datasets
  • Allows training for fewer than one full epoch (via --train-num-samples)
  • Required when using multiple data sources with upsampling

Partial Epochs for Large Datasets

For very large datasets (LAION-2B), train on a fraction of an epoch:
python -m open_clip_train.main \
    --train-data "/data/laion2b/{00000..100000}.tar" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --dataset-resampled \
    --epochs 32 \
    # ... other arguments
# train-num-samples = 2.17B / 16 ≈ 135M (1/16 of full dataset per "epoch")
# More frequent checkpoints and evaluation

Multiple Data Sources

Combine multiple datasets using the :: separator:
python -m open_clip_train.main \
    --train-data "/data/cc12m/train-{0000..2175}.tar::/data/laion400m/{00000..41455}.tar" \
    --dataset-type webdataset \
    --dataset-resampled \
    # ... other arguments

Data Upsampling Factors

By default, samples from each source are proportional to dataset size. Use --train-data-upsampling-factors to control weighting:
# Equal sampling from both datasets
python -m open_clip_train.main \
    --train-data "/data/cc12m/train.tar::/data/cc3m/train.tar" \
    --train-data-upsampling-factors "1::4" \
    --dataset-resampled
    # ... other arguments
# CC12M: ~12M samples, weight 1
# CC3M: ~3M samples, weight 4
# Effective sampling ratio: 12M:12M (equal)
Examples:
# Default behavior (proportional to size)
--train-data "dataset_a.tar::dataset_b.tar"
# Dataset A: 1000 samples, Dataset B: 100 samples
# Sampling probability: A=90%, B=10%

# Equal sampling
--train-data "dataset_a.tar::dataset_b.tar" \
--train-data-upsampling-factors "0.001::0.01"
# or equivalently: --train-data-upsampling-factors "1::10"
# Sampling probability: A=50%, B=50%

# Custom weighting
--train-data "high_quality.tar::medium_quality.tar::low_quality.tar" \
--train-data-upsampling-factors "2::1::0.5"
# High quality: 2x weight, Medium: 1x, Low: 0.5x

Combining LAION and CC12M Example

python -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-{0000..2175}.tar::/data/laion400m/{00000..41455}.tar" \
    --train-data-upsampling-factors "1::1" \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --model ViT-B-32
# Both datasets sampled equally despite different sizes

Remote Data Loading

WebDataset supports loading data from remote sources:

AWS S3

python -m open_clip_train.main \
    --train-data "pipe:aws s3 cp s3://my-bucket/train/train-{0000..2175}.tar -" \
    --dataset-type webdataset \
    # ... other arguments

HTTP/HTTPS

python -m open_clip_train.main \
    --train-data "https://example.com/datasets/train-{0000..2175}.tar" \
    --dataset-type webdataset \
    # ... other arguments

Google Cloud Storage

python -m open_clip_train.main \
    --train-data "pipe:gsutil cat gs://my-bucket/train-{0000..2175}.tar" \
    --dataset-type webdataset \
    # ... other arguments
Remote data loading requires network bandwidth. For best performance:
  • Use local storage when possible
  • Ensure high-bandwidth network connection
  • Monitor network utilization with iftop or nethogs

Data Format Specifications

Image Formats

Supported image formats:
  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • WebP (.webp)
  • Any format supported by PIL/Pillow
Recommendations:
  • Use JPEG for photographs (smaller file size)
  • Use PNG for images with transparency or text
  • Store at reasonable resolution (224-512px for most models)

Text Formats

WebDataset: Plain text files (.txt)
A photo of a cat sitting on a windowsill
CSV: Quoted strings
/path/to/image.jpg,"A photo of a cat sitting on a windowsill"
Caption guidelines:
  • Keep captions concise but descriptive
  • Typical length: 5-20 words
  • Remove special characters that may cause parsing issues
  • Use UTF-8 encoding for international text

Data Quality Considerations

Image Quality

Good practices:
  • Filter out corrupted or unreadable images
  • Remove duplicates
  • Ensure minimum resolution (e.g., 224x224)
  • Verify aspect ratios are reasonable
  • Check for NSFW content if needed

Caption Quality

Good practices:
  • Remove empty or very short captions (<3 words)
  • Filter out captions with excessive special characters
  • Remove duplicate captions
  • Consider language filtering for multilingual datasets
  • Remove personally identifiable information (PII)

Dataset Size Recommendations

Use CaseDataset SizeFormatExample
Experimentation1M-10MCSV or WebDatasetCC3M, CC12M
Small-scale training10M-100MWebDatasetYFCC15M
Medium-scale100M-500MWebDatasetLAION-400M
Large-scale500M+WebDatasetLAION-2B, DataComp-1B

Example Datasets

Conceptual Captions 3M (CC3M)

# Download and prepare CC3M
wget https://ai.google.com/research/ConceptualCaptions/download

# Convert to WebDataset using img2dataset
img2dataset --url_list Train_GCC-training.tsv \
    --input_format "tsv" \
    --url_col "url" \
    --caption_col "caption" \
    --output_folder cc3m \
    --output_format webdataset \
    --processes_count 16 \
    --thread_count 64

# Train
python -m open_clip_train.main \
    --train-data "cc3m/{00000..00331}.tar" \
    --train-num-samples 3000000 \
    --dataset-type webdataset \
    --model ViT-B-32

Conceptual Captions 12M (CC12M)

# Download CC12M
wget https://github.com/google-research-datasets/conceptual-12m/blob/master/cc12m.tsv

# Convert to WebDataset
img2dataset --url_list cc12m.tsv \
    --input_format "tsv" \
    --url_col "url" \
    --caption_col "caption" \
    --output_folder cc12m \
    --output_format webdataset \
    --processes_count 32 \
    --thread_count 128

# Train
python -m open_clip_train.main \
    --train-data "cc12m/cc12m-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --dataset-resampled \
    --model ViT-B-32

LAION Datasets

LAION datasets are available as pre-processed WebDatasets: See LAION documentation for download instructions.

Preprocessing and Augmentation

Image Preprocessing

Images are automatically preprocessed during training:
  1. Resize to model input size (default: 224x224)
  2. Normalize with ImageNet statistics
  3. Data augmentation (optional)
Control preprocessing with these flags:
--image-mean 0.485 0.456 0.406 \
--image-std 0.229 0.224 0.225 \
--image-interpolation bicubic \
--image-resize-mode shortest

Data Augmentation

Custom augmentation via --aug-cfg:
--aug-cfg scale_range=0.08::1.0 ratio_range=0.75::1.33
See Configuration for all augmentation options.

Storage Optimization

Tar File Size

Recommended tar file sizes:
  • 1,000-10,000 samples per .tar for local storage
  • Smaller tars (1,000-5,000) for remote/networked storage
  • Balance between random access and I/O efficiency

Compression

WebDataset .tar files can be compressed:
# Create compressed tars
tar czf train-0000.tar.gz *.jpg *.txt

# Use in training (automatic decompression)
--train-data "train-{0000..1000}.tar.gz"
Tradeoff: Smaller storage vs. CPU overhead for decompression

SSD vs HDD

StorageRandom AccessSequentialBest For
NVMe SSDExcellentExcellentCSV, small WebDataset
SATA SSDGoodGoodCSV, WebDataset
HDDPoorGoodLarge WebDataset (sequential reads)
NetworkVariableVariableRemote WebDataset

Troubleshooting

Slow Data Loading

Symptom: Low GPU utilization (<80%) Solutions:
  1. Increase --workers (more data loading processes)
  2. Use faster storage (SSD instead of HDD)
  3. Convert CSV to WebDataset for large datasets
  4. Reduce image resolution in preprocessing
  5. Check network bandwidth for remote data

Corrupted Images

Symptom: Training crashes with PIL errors Solutions:
  1. Filter corrupted images during preprocessing:
    img2dataset --skip_reencode False
    
  2. Add error handling in custom data pipeline
  3. Validate all images before creating WebDataset

Missing Files

Symptom: FileNotFoundError during training Solutions:
  1. Use absolute paths in CSV files
  2. Verify file permissions
  3. Check glob patterns for WebDataset
  4. Ensure all nodes have access to shared storage (multi-node)

Unbalanced Data Sources

Symptom: Model overfits to larger dataset Solutions:
  1. Use --train-data-upsampling-factors to balance sampling
  2. Create weighted mixture of datasets
  3. Train on larger dataset first, then fine-tune on smaller

Data Validation Script

Validate your WebDataset before training:
import webdataset as wds
from PIL import Image
import io

# Test loading from WebDataset
dataset = wds.WebDataset("/data/train-0000.tar")
for i, sample in enumerate(dataset):
    if i >= 10:  # Check first 10 samples
        break
    
    # Verify image can be loaded
    img = Image.open(io.BytesIO(sample["jpg"]))
    print(f"Sample {i}: {img.size}, caption: {sample['txt'][:50]}...")
    
print("Validation complete!")

Next Steps

Single-Node Training

Train on prepared data with single machine

Configuration

Configure training parameters and data augmentation

Distributed Training

Optimize data loading for distributed training

Training Overview

Return to training overview

Build docs developers (and LLMs) love