Overview
Remote training features in OpenCLIP:- Resume from remote paths: Load checkpoints directly from S3 or other remote storage
- Automatic backup: Continuously sync training checkpoints to remote storage
- fsspec support: Work with any filesystem supported by fsspec
- Checkpoint management: Automatically clean up old checkpoints to save space
Resuming from Remote Checkpoints
You can resume training directly from a remote checkpoint without downloading it first.Resume from S3
Use the S3 URI directly in the--resume flag:
Resume from Other Remote Storage
OpenCLIP usesfsspec to support various storage backends:
Complete Resume Example
Automatic Remote Backup
Continuously back up training checkpoints to remote storage during training. This prevents data loss and enables easy resume from any point.Basic Remote Sync Setup
Use--remote-sync to specify the remote destination:
- Save checkpoints locally to
/scratch/training/my-experiment/ - Sync to
s3://my-bucket/training-runs/my-experiment/ - Run sync in background every 5 minutes (default)
Remote Sync Parameters
—remote-sync
Specify the remote path for backup:- S3:
s3://bucket-name/path - S3 with credentials:
s3://bucket-name/path(uses AWS credentials) - Other fsspec backends:
gs://,az://, etc.
—remote-sync-frequency
How often to sync (in seconds):- Fast storage: 300 seconds (5 minutes)
- Slow storage: 900-1800 seconds (15-30 minutes)
- Large checkpoints: 600+ seconds
- Small checkpoints: 300 seconds
—remote-sync-protocol
Specify the sync protocol:fsspec protocol is currently experimental and very slow. Use s3 for production workloads.
Complete Remote Training Examples
Example 1: S3 Training with Backup
Train with local SSD and automatic S3 backup:- Local checkpoints:
/scratch/openclip/vitb32-cc12m/checkpoints/ - Remote backup:
s3://my-training-bucket/experiments/vitb32-cc12m/checkpoints/ - Syncs every 5 minutes
- Final sync when training completes
Example 2: Multi-GPU with Remote Sync
Example 3: Resume from S3 and Continue Syncing
Checkpoint Management
Delete Previous Checkpoints
Save disk space by automatically deleting old checkpoints:- Keep only the most recent checkpoint locally
- Delete previous checkpoints after saving a new one
- Remote backups are unaffected (all checkpoints synced)
- Useful when local storage is limited
- Local: Only
epoch_latest.ptis kept - Remote: All epochs synced to S3 (
epoch_1.pt,epoch_2.pt, etc.)
Resume Latest from Remote
When using--resume latest with remote sync:
- Only works with
--remote-sync-protocol s3 - Does not work with
--save-most-recent - Checks remote storage for latest checkpoint
- May not find checkpoint if sync is still in progress
AWS S3 Configuration
AWS Credentials
Ensure AWS credentials are configured:S3 Bucket Setup
Workflow Patterns
Pattern 1: Fast Local Storage + S3 Backup
Best for: Training on cloud instances with local SSDsPattern 2: Resume After Interruption
Best for: Spot instances, preemptible VMsPattern 3: Centralized Checkpoint Storage
Best for: Team collaboration, multiple training nodesSync Process Details
How Sync Works
- Training starts, background sync process launched
- Every
--remote-sync-frequencyseconds:- Sync all files from local logs directory to remote
- Only uploads changed/new files
- Sync happens in background, doesn’t block training
- When training completes:
- Final sync ensures all checkpoints uploaded
- Process waits for final sync to complete
What Gets Synced
All files in the local logs directory:- Checkpoints (
epoch_*.pt) - TensorBoard logs
- Training logs
- Configuration files
- Any other files in the directory
Sync Command (S3)
Under the hood, uses AWS CLI:Performance Considerations
Sync Frequency
Too frequent:- Wastes bandwidth
- May impact training performance
- Unnecessary for large checkpoints
- Risk losing more progress on failure
- Longer wait for final sync
- Small models: 300-600 seconds
- Large models: 600-1800 seconds
- Fast networks: 300 seconds
- Slow networks: 900+ seconds
Network Impact
- Sync runs in background process
- Minimal impact on training throughput
- May affect data loading if sharing bandwidth
- Use local data loading when possible
Troubleshooting
Sync Failing
Check AWS credentials:Resume from S3 Failing
Verify checkpoint exists:Slow Sync
- Use
--remote-sync-protocol s3(notfsspec) - Increase
--remote-sync-frequency - Check network bandwidth
- Consider using S3 Transfer Acceleration
- Reduce checkpoint size if possible
”Resume latest” Not Finding Checkpoint
- Ensure sync completed before trying to resume
- Check remote path matches expectations
- Use explicit checkpoint path instead of “latest”
- Verify
--remote-sync-protocol s3is set
Best Practices
-
Always use remote sync for long training runs
- Prevents data loss from hardware failures
- Enables easy resume from any point
-
Use local fast storage with remote backup
- Local SSD for training speed
- S3 for durability and sharing
-
Delete old local checkpoints
- Use
--delete-previous-checkpoint - Keep all checkpoints in remote storage
- Use
-
Set appropriate sync frequency
- Balance between safety and performance
- Consider checkpoint size and network speed
-
Test sync before long runs
- Verify credentials and permissions
- Test manual sync first
- Monitor first few syncs
-
Use unique experiment names
- Prevents conflicts in shared storage
- Makes checkpoints easy to find
- Include timestamp or identifier
-
Monitor sync process
- Check logs for sync errors
- Verify files appearing in remote storage
- Test resume before needing it
