Remote Training - OpenCLIP

OpenCLIP supports remote training workflows that allow you to resume from remote checkpoints and automatically back up training progress to cloud storage. This is particularly useful for large-scale training on cloud infrastructure or when using shared storage systems.

Overview

Remote training features in OpenCLIP:

Resume from remote paths: Load checkpoints directly from S3 or other remote storage
Automatic backup: Continuously sync training checkpoints to remote storage
fsspec support: Work with any filesystem supported by fsspec
Checkpoint management: Automatically clean up old checkpoints to save space

Resuming from Remote Checkpoints

You can resume training directly from a remote checkpoint without downloading it first.

Resume from S3

Use the S3 URI directly in the --resume flag:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume s3://my-bucket/checkpoints/epoch_10.pt \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32

Resume from Other Remote Storage

OpenCLIP uses fsspec to support various storage backends:

# Google Cloud Storage
--resume gs://my-bucket/checkpoints/epoch_10.pt

# Azure Blob Storage
--resume az://my-container/checkpoints/epoch_10.pt

# HTTP/HTTPS
--resume https://my-server.com/checkpoints/epoch_10.pt

Complete Resume Example

python -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --model ViT-L-14 \
    --resume s3://my-training-bucket/runs/vitl14-run1/checkpoints/epoch_15.pt \
    --epochs 32 \
    --lr 1e-3 \
    --warmup 2000

Automatic Remote Backup

Continuously back up training checkpoints to remote storage during training. This prevents data loss and enables easy resume from any point.

Basic Remote Sync Setup

Use --remote-sync to specify the remote destination:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/training-runs \
    --name my-experiment

This will:

Save checkpoints locally to /scratch/training/my-experiment/
Sync to s3://my-bucket/training-runs/my-experiment/
Run sync in background every 5 minutes (default)

Remote Sync Parameters

—remote-sync

Specify the remote path for backup:

--remote-sync s3://my-bucket/training-runs

Supported formats:

S3: s3://bucket-name/path
S3 with credentials: s3://bucket-name/path (uses AWS credentials)
Other fsspec backends: gs://, az://, etc.

—remote-sync-frequency

How often to sync (in seconds):

--remote-sync-frequency 300  # Sync every 5 minutes (default)
--remote-sync-frequency 600  # Sync every 10 minutes
--remote-sync-frequency 1800 # Sync every 30 minutes

Recommendations:

Fast storage: 300 seconds (5 minutes)
Slow storage: 900-1800 seconds (15-30 minutes)
Large checkpoints: 600+ seconds
Small checkpoints: 300 seconds

—remote-sync-protocol

Specify the sync protocol:

--remote-sync-protocol s3      # Use S3 protocol (default, recommended)
--remote-sync-protocol fsspec  # Use fsspec (experimental, slow)

Note: The fsspec protocol is currently experimental and very slow. Use s3 for production workloads.

Complete Remote Training Examples

Example 1: S3 Training with Backup

Train with local SSD and automatic S3 backup:

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data "/data/cc12m/train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 8 \
    --model ViT-B-32 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --logs /scratch/openclip \
    --remote-sync s3://my-training-bucket/experiments \
    --remote-sync-frequency 300 \
    --name vitb32-cc12m \
    --imagenet-val /data/imagenet/validation/

The sync process:

Local checkpoints: /scratch/openclip/vitb32-cc12m/checkpoints/
Remote backup: s3://my-training-bucket/experiments/vitb32-cc12m/checkpoints/
Syncs every 5 minutes
Final sync when training completes

Example 2: Multi-GPU with Remote Sync

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --model ViT-L-14 \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --warmup 2000 \
    --lr 1e-3 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/large-runs \
    --remote-sync-frequency 600 \
    --delete-previous-checkpoint \
    --name vitl14-laion400m

Example 3: Resume from S3 and Continue Syncing

python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume s3://my-bucket/experiments/run1/checkpoints/epoch_10.pt \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/experiments \
    --name run1

Checkpoint Management

Delete Previous Checkpoints

Save disk space by automatically deleting old checkpoints:

--delete-previous-checkpoint

This will:

Keep only the most recent checkpoint locally
Delete previous checkpoints after saving a new one
Remote backups are unaffected (all checkpoints synced)
Useful when local storage is limited

Example:

python -m open_clip_train.main \
    --model ViT-L-14 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --delete-previous-checkpoint \
    --name vitl14-experiment

Result:

Local: Only epoch_latest.pt is kept
Remote: All epochs synced to S3 (epoch_1.pt, epoch_2.pt, etc.)

Resume Latest from Remote

When using --resume latest with remote sync:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume latest \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --remote-sync-protocol s3 \
    --name my-experiment

Important limitations:

Only works with --remote-sync-protocol s3
Does not work with --save-most-recent
Checks remote storage for latest checkpoint
May not find checkpoint if sync is still in progress

AWS S3 Configuration

AWS Credentials

Ensure AWS credentials are configured:

# Option 1: AWS CLI configuration
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2

# Option 3: IAM role (on EC2)
# Credentials automatically available

S3 Bucket Setup

# Create bucket
aws s3 mb s3://my-training-bucket

# Set lifecycle policy to manage old checkpoints
aws s3api put-bucket-lifecycle-configuration \
    --bucket my-training-bucket \
    --lifecycle-configuration file://lifecycle.json

lifecycle.json:

{
  "Rules": [
    {
      "Id": "DeleteOldCheckpoints",
      "Status": "Enabled",
      "Prefix": "experiments/",
      "Expiration": {
        "Days": 90
      }
    }
  ]
}

Workflow Patterns

Pattern 1: Fast Local Storage + S3 Backup

Best for: Training on cloud instances with local SSDs

# Use local SSD for speed
--logs /scratch/training

# Back up to S3 for durability
--remote-sync s3://my-bucket/runs

# Clean up local storage
--delete-previous-checkpoint

Pattern 2: Resume After Interruption

Best for: Spot instances, preemptible VMs

# Initial training
python -m open_clip_train.main \
    --model ViT-B-32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --name experiment-1 \
    ...

# After interruption, resume
python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume s3://my-bucket/runs/experiment-1/checkpoints/epoch_latest.pt \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --name experiment-1 \
    ...

Pattern 3: Centralized Checkpoint Storage

Best for: Team collaboration, multiple training nodes

# All nodes sync to same bucket
--remote-sync s3://team-bucket/shared-experiments

# Each experiment has unique name
--name ${USER}-${MODEL}-${DATE}

# Team members can resume from any checkpoint
--resume s3://team-bucket/shared-experiments/alice-vitb32-2024/checkpoints/epoch_10.pt

Sync Process Details

How Sync Works

Training starts, background sync process launched
Every --remote-sync-frequency seconds:
- Sync all files from local logs directory to remote
- Only uploads changed/new files
- Sync happens in background, doesn’t block training
When training completes:
- Final sync ensures all checkpoints uploaded
- Process waits for final sync to complete

What Gets Synced

All files in the local logs directory:

Checkpoints (epoch_*.pt)
TensorBoard logs
Training logs
Configuration files
Any other files in the directory

Sync Command (S3)

Under the hood, uses AWS CLI:

aws s3 sync /local/path s3://bucket/remote/path --exact-timestamps

Performance Considerations

Sync Frequency

Too frequent:

Wastes bandwidth
May impact training performance
Unnecessary for large checkpoints

Too infrequent:

Risk losing more progress on failure
Longer wait for final sync

Recommended:

Small models: 300-600 seconds
Large models: 600-1800 seconds
Fast networks: 300 seconds
Slow networks: 900+ seconds

Network Impact

Sync runs in background process
Minimal impact on training throughput
May affect data loading if sharing bandwidth
Use local data loading when possible

Troubleshooting

Sync Failing

Check AWS credentials:

aws s3 ls s3://my-bucket/

Check bucket permissions:

aws s3api get-bucket-acl --bucket my-bucket

Test manual sync:

aws s3 sync /local/path s3://my-bucket/test-path

Resume from S3 Failing

Verify checkpoint exists:

aws s3 ls s3://my-bucket/path/to/checkpoint.pt

Check file size:

aws s3 ls s3://my-bucket/path/to/checkpoint.pt --human-readable

Test download manually:

aws s3 cp s3://my-bucket/path/to/checkpoint.pt /tmp/test.pt

Slow Sync

Use --remote-sync-protocol s3 (not fsspec)
Increase --remote-sync-frequency
Check network bandwidth
Consider using S3 Transfer Acceleration
Reduce checkpoint size if possible

”Resume latest” Not Finding Checkpoint

Ensure sync completed before trying to resume
Check remote path matches expectations
Use explicit checkpoint path instead of “latest”
Verify --remote-sync-protocol s3 is set

Best Practices

Always use remote sync for long training runs
- Prevents data loss from hardware failures
- Enables easy resume from any point
Use local fast storage with remote backup
- Local SSD for training speed
- S3 for durability and sharing
Delete old local checkpoints
- Use --delete-previous-checkpoint
- Keep all checkpoints in remote storage
Set appropriate sync frequency
- Balance between safety and performance
- Consider checkpoint size and network speed
Test sync before long runs
- Verify credentials and permissions
- Test manual sync first
- Monitor first few syncs
Use unique experiment names
- Prevents conflicts in shared storage
- Makes checkpoints easy to find
- Include timestamp or identifier
Monitor sync process
- Check logs for sync errors
- Verify files appearing in remote storage
- Test resume before needing it

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Resuming from Remote Checkpoints

​Resume from S3

​Resume from Other Remote Storage

​Complete Resume Example

​Automatic Remote Backup

​Basic Remote Sync Setup

​Remote Sync Parameters

​—remote-sync

​—remote-sync-frequency

​—remote-sync-protocol

​Complete Remote Training Examples

​Example 1: S3 Training with Backup

​Example 2: Multi-GPU with Remote Sync

​Example 3: Resume from S3 and Continue Syncing

​Checkpoint Management

​Delete Previous Checkpoints

​Resume Latest from Remote

​AWS S3 Configuration

​AWS Credentials

​S3 Bucket Setup

​Workflow Patterns

​Pattern 1: Fast Local Storage + S3 Backup

​Pattern 2: Resume After Interruption

​Pattern 3: Centralized Checkpoint Storage

​Sync Process Details

​How Sync Works

​What Gets Synced

​Sync Command (S3)

​Performance Considerations

​Sync Frequency

​Network Impact

​Troubleshooting

​Sync Failing

​Resume from S3 Failing

​Slow Sync

​”Resume latest” Not Finding Checkpoint

​Best Practices

Build docs developers (and LLMs) love