Supported Cloud Storage
Amazon S3
AWS object storage with S3 API compatibility
Google Cloud Storage
Google Cloud’s scalable object storage
Delta Lake (S3)
Open table format on Amazon S3
Delta Lake (Azure)
Open table format on Azure Blob Storage
Amazon S3
Configuration
Features
- Multiple file formats - Parquet (recommended) and CSV
- IAM role support - Secure credential-less authentication
- Date partitioning - Automatic folder organization by date
- Custom endpoints - Support for MinIO, Wasabi, and other S3-compatible storage
- Column header formatting - Lowercase or uppercase column names
- Automatic compression - Built-in Parquet compression
File Naming Convention
Files are automatically named with timestamps:Parquet Format (Recommended)
Parquet provides:- Columnar storage - Efficient compression and query performance
- Schema preservation - Maintains data types
- Fast reads - Optimized for analytics
- Small file size - 5-10x smaller than CSV
CSV Format
Use CSV for compatibility:Date Partitioning
Organize data by date for efficient querying:S3-Compatible Storage
Connect to MinIO, Wasabi, DigitalOcean Spaces, etc.:Google Cloud Storage (GCS)
Configuration
Features
- Service account authentication - Secure access with JSON key files
- Application default credentials - Use GCE/GKE service accounts
- Parquet and CSV - Multiple file format support
- Date partitioning - Organize data by date
- Automatic retries - Built-in error handling
File Structure
GCS files follow the same convention as S3:Authentication Methods
Service Account JSON Key
Service Account JSON Key
Create a service account with Storage Object Creator role:Reference in config:
Application Default Credentials
Application Default Credentials
When running on GCP (GCE, GKE, Cloud Run):
- Attach a service account to your compute instance
- Grant Storage Object Creator role to the service account
- Omit
google_application_credentialsfrom config
Delta Lake on S3
Configuration
Features
- ACID transactions - Reliable writes with transaction log
- Schema evolution - Add/modify columns safely
- Time travel - Query historical versions
- Partition management - Automatic partition handling
- Overwrite mode - Replace specific partitions
- Data versioning - Track all changes with _delta_log
Delta Lake Structure
Delta Lake creates a table directory with:Write Modes
Append Mode (Default)
Append Mode (Default)
Add new data without modifying existing records:
- Fastest write mode
- Always creates new files
- Ideal for immutable data
Overwrite Mode
Overwrite Mode
Replace data, optionally by partition:When partitioned:
- Only replaces affected partitions
- Other partitions remain unchanged
- Useful for daily/hourly updates
- Replaces entire table
- Use with caution
Partition Overwrite
When using overwrite mode with partitions, Mage:- Writes new data to Delta table
- Identifies affected partitions
- Removes old files from those partitions
- Updates Delta transaction log
Querying Delta Tables
Delta tables are compatible with:- Apache Spark - Native support
- Databricks - Full Delta Lake features
- Trino/Presto - Via Delta Lake connector
- AWS Athena - Query Delta tables directly
- Delta-RS - Rust/Python library
Delta Lake on Azure
Configuration
Features
- Azure Blob Storage - Integration with Azure Data Lake Gen2
- ACID transactions - Same Delta Lake guarantees
- Schema evolution - Automatic schema management
- Partition support - Organize data efficiently
Azure Authentication
Data Type Handling
Parquet Schema Preservation
Parquet automatically preserves data types:| Python Type | Parquet Type |
|---|---|
| str | STRING |
| int | INT64 |
| float | DOUBLE |
| bool | BOOLEAN |
| datetime | TIMESTAMP |
| date | DATE32 |
| list | LIST |
| dict | STRUCT |
CSV Limitations
CSV files lose type information:- All columns are strings
- Datetime formatting may vary
- Arrays/objects become JSON strings
Internal Columns
All exports include tracking columns:_mage_created_at- ISO 8601 timestamp of creation_mage_updated_at- ISO 8601 timestamp of last update
Performance Optimization
Parquet Optimization
Parquet Optimization
Use Parquet for Best PerformanceParquet provides:
- Columnar compression (5-10x smaller files)
- Predicate pushdown for faster queries
- Schema evolution support
- Native type preservation
- Good balance of speed and compression
- Fast decompression for queries
- ~2-4x compression ratio
Partition Strategy
Partition Strategy
Choose Partition Keys WiselyGood partition keys:Partition Size Guidelines
- High cardinality but not too high (100-1000s of partitions)
- Frequently used in WHERE clauses
- Evenly distributed data
- Target 100MB - 1GB per partition
- Avoid small files (less than 10MB)
- Use date partitioning for time-series data
Delta Lake Optimization
Delta Lake Optimization
Optimize TablePeriodically compact small files:Vacuum Old FilesRemove old file versions:Z-Order ClusteringFor frequently filtered columns:
Example: S3 Export with Partitioning
Testing Connections
Common Issues
S3 Permission Errors
S3 Permission Errors
Ensure IAM user/role has:
GCS Permission Errors
GCS Permission Errors
Grant service account roles:
roles/storage.objectCreator- Write accessroles/storage.objectViewer- Read access (for testing)
storage.objects.createstorage.objects.getstorage.buckets.get
Delta Lake Region Errors
Delta Lake Region Errors
Error: “Received redirect without LOCATION”Cause: AWS region mismatchSolution: Ensure
aws_region matches S3 bucket region:Next Steps
Streaming Destinations
Learn about Kafka and real-time data export
Data Warehouses
Configure BigQuery, Snowflake, and Redshift