Supported Platforms
Amazon S3
AWS object storage with global availability
Google Cloud Storage
GCP object storage and data lakes
Azure Blob Storage
Azure cloud storage platform
Amazon S3
Extract data from S3 buckets with support for CSV, Parquet, and multiple file patterns.Configuration
File Types
Supported file formats:csv- Comma-separated valuesparquet- Columnar storage format
IAM Role Authentication
Use IAM roles instead of access keys:Multiple Tables Configuration
Extract different tables from different prefixes:Custom S3-Compatible Endpoints
Connect to MinIO, DigitalOcean Spaces, or other S3-compatible services:Google Cloud Storage
Load data from GCS buckets with service account authentication.Configuration
Credentials Info (Alternative)
Pass credentials directly instead of file path:Setup GCS Service Account
Setup GCS Service Account
- Go to Google Cloud Console
- Navigate to IAM & Admin > Service Accounts
- Create a new service account
- Grant “Storage Object Viewer” role
- Create and download JSON key
- Grant service account access to the bucket
Azure Blob Storage
Extract data from Azure Blob Storage containers.Configuration
Connection String Format
Get Azure Connection String
Get Azure Connection String
- Go to Azure Portal
- Navigate to Storage Accounts
- Select your storage account
- Go to Access keys
- Copy the connection string
File Format Support
CSV Files
Automatically detect encoding and delimiters:- Various encodings (UTF-8, Latin-1, etc.)
- Different delimiters (comma, tab, pipe)
- Headers and data types
- Null values
Parquet Files
Native Parquet support for efficient columnar storage:- Compressed storage
- Column pruning
- Predicate pushdown
- Schema preservation
Excel Files (API Source)
For Excel files, use the generic API source:Search Patterns
Use regex patterns to filter files:Match specific file names
Match specific file names
users_2024.csv, users_01.csvMatch by date
Match by date
data_2024-03-01.parquet, data_2024-03-15.parquetExclude files
Exclude files
Multiple extensions
Multiple extensions
Incremental Loading
Load only new or modified files based on LastModified timestamp:Schema Discovery
Mage automatically infers schema from the first file:Partitioned Data
Load data organized in partitioned folders:SFTP Support
For SFTP file transfers, use the SFTP source:Installation
Best Practices
- Use IAM roles instead of access keys when running in cloud environments
- Organize files by prefix for easier management and partitioning
- Use Parquet format for large datasets to reduce storage and transfer costs
- Implement file naming conventions that include timestamps
- Enable versioning on buckets for data recovery
- Set lifecycle policies to archive old files
- Monitor storage costs and set up alerts
- Use incremental loading to avoid reprocessing unchanged files
- Test with small file sets before full production runs
- Compress CSV files (gzip) to reduce transfer time
Performance Tips
Large Files
For very large files, consider:- Using Parquet instead of CSV
- Splitting files into smaller chunks
- Using columnar formats for analytics
Many Small Files
For many small files:- Combine files before loading
- Use batching in your pipeline
- Consider using streaming sources instead
Cross-Region Transfers
For cross-region data:- Use same region for source and Mage deployment
- Enable transfer acceleration (S3)
- Consider data replication
Troubleshooting
Access Denied Error
Access Denied Error
Check IAM permissions:
- S3:
s3:GetObject,s3:ListBucket - GCS:
storage.objects.get,storage.objects.list - Azure: Storage Blob Data Reader role
Schema Mismatch
Schema Mismatch
Ensure all files have consistent schema:
- Same column names
- Same data types
- Same delimiter (for CSV)
Empty Results
Empty Results
Verify:
- Prefix path is correct
- Search pattern matches files
- Files are not empty
- Credentials have access
Encoding Issues
Encoding Issues
For CSV files with special characters:
- Ensure UTF-8 encoding
- Check for BOM characters
- Verify delimiter characters
Next Steps
Streaming Sources
Real-time data from Kafka, Kinesis, and Pub/Sub
Database Sources
Connect to PostgreSQL, MySQL, Snowflake, and more