Cloud Storage Destinations

Mage supports exporting data to major cloud storage platforms and open table formats. These destinations are ideal for data lakes, long-term storage, and integration with analytics engines.

Supported Cloud Storage

Amazon S3

AWS object storage with S3 API compatibility

Google Cloud Storage

Google Cloud’s scalable object storage

Delta Lake (S3)

Open table format on Amazon S3

Delta Lake (Azure)

Open table format on Azure Blob Storage

Amazon S3

Configuration

bucket: my-data-bucket
object_key_path: raw/events
table: user_events
file_type: parquet  # or csv
aws_access_key_id: AKIAIOSFODNN7EXAMPLE
aws_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
aws_region: us-west-2

Features

Multiple file formats - Parquet (recommended) and CSV
IAM role support - Secure credential-less authentication
Date partitioning - Automatic folder organization by date
Custom endpoints - Support for MinIO, Wasabi, and other S3-compatible storage
Column header formatting - Lowercase or uppercase column names
Automatic compression - Built-in Parquet compression

File Naming Convention

Files are automatically named with timestamps:

s3://bucket/object_key_path/table/YYYYMMDD-HHMMSS.parquet

With date partitioning:

s3://bucket/object_key_path/table/2024/03/04/20240304-153045.parquet

Parquet Format (Recommended)

Parquet provides:

Columnar storage - Efficient compression and query performance
Schema preservation - Maintains data types
Fast reads - Optimized for analytics
Small file size - 5-10x smaller than CSV

file_type: parquet

CSV Format

Use CSV for compatibility:

file_type: csv
column_header_format: lower  # or upper

Date Partitioning

Organize data by date for efficient querying:

date_partition_format: "%Y/%m/%d"        # 2024/03/04/
date_partition_format: "%Y-%m-%d"        # 2024-03-04/
date_partition_format: "year=%Y/month=%m" # year=2024/month=03/

S3-Compatible Storage

Connect to MinIO, Wasabi, DigitalOcean Spaces, etc.:

aws_endpoint: https://s3.us-east-005.backblazeb2.com
# or
aws_endpoint: http://minio.local:9000

Google Cloud Storage (GCS)

Configuration

bucket: my-gcs-bucket
object_key_path: raw/events
table: user_events
file_type: parquet
google_application_credentials: /path/to/service-account.json

Features

Service account authentication - Secure access with JSON key files
Application default credentials - Use GCE/GKE service accounts
Parquet and CSV - Multiple file format support
Date partitioning - Organize data by date
Automatic retries - Built-in error handling

File Structure

GCS files follow the same convention as S3:

gs://bucket/object_key_path/table/20240304-153045.parquet

With partitioning:

gs://bucket/object_key_path/table/2024/03/04/20240304-153045.parquet

Authentication Methods

Service Account JSON Key

Create a service account with Storage Object Creator role:

# Create service account
gcloud iam service-accounts create mage-data-exporter \
  --display-name="Mage Data Exporter"

# Grant Storage Object Creator role
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:mage-data-exporter@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectCreator"

# Create and download key
gcloud iam service-accounts keys create key.json \
  --iam-account=mage-data-exporter@PROJECT_ID.iam.gserviceaccount.com

Reference in config:

google_application_credentials: /path/to/key.json

Application Default Credentials

When running on GCP (GCE, GKE, Cloud Run):

Attach a service account to your compute instance
Grant Storage Object Creator role to the service account
Omit google_application_credentials from config

Mage automatically uses the attached service account.

Delta Lake on S3

Configuration

bucket: my-delta-bucket
object_key_path: delta/tables
table: user_events
aws_access_key_id: AKIAIOSFODNN7EXAMPLE
aws_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
aws_region: us-west-2

Features

ACID transactions - Reliable writes with transaction log
Schema evolution - Add/modify columns safely
Time travel - Query historical versions
Partition management - Automatic partition handling
Overwrite mode - Replace specific partitions
Data versioning - Track all changes with _delta_log

Delta Lake Structure

Delta Lake creates a table directory with:

s3://bucket/object_key_path/table/
├── _delta_log/
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   └── _last_checkpoint
├── part-00000-xxx.snappy.parquet
├── part-00001-xxx.snappy.parquet
└── ...

Write Modes

Append Mode (Default)

Add new data without modifying existing records:

mode: append

Fastest write mode
Always creates new files
Ideal for immutable data

Overwrite Mode

Replace data, optionally by partition:

mode: overwrite
partition_keys:
  - event_date

When partitioned:

Only replaces affected partitions
Other partitions remain unchanged
Useful for daily/hourly updates

When not partitioned:

Replaces entire table
Use with caution

Partition Overwrite

When using overwrite mode with partitions, Mage:

Writes new data to Delta table
Identifies affected partitions
Removes old files from those partitions
Updates Delta transaction log

mode: overwrite
partition_keys:
  - date  # Only overwrites data for dates in new batch

Querying Delta Tables

Delta tables are compatible with:

Apache Spark - Native support
Databricks - Full Delta Lake features
Trino/Presto - Via Delta Lake connector
AWS Athena - Query Delta tables directly
Delta-RS - Rust/Python library

# Query with Delta-RS Python
import deltalake as dl

dt = dl.DeltaTable("s3://bucket/object_key_path/table")
df = dt.to_pandas()

Delta Lake on Azure

Configuration

account_name: mystorageaccount
access_key: your_access_key
table_uri: abfss://[email protected]/path/to/table
table: user_events

Features

Azure Blob Storage - Integration with Azure Data Lake Gen2
ACID transactions - Same Delta Lake guarantees
Schema evolution - Automatic schema management
Partition support - Organize data efficiently

Azure Authentication

account_name: mystorageaccount
access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

The table URI format:

abfss://[email protected]/path/to/table

Data Type Handling

Parquet Schema Preservation

Parquet automatically preserves data types:

Python Type	Parquet Type
str	STRING
int	INT64
float	DOUBLE
bool	BOOLEAN
datetime	TIMESTAMP
date	DATE32
list	LIST
dict	STRUCT

CSV Limitations

CSV files lose type information:

All columns are strings
Datetime formatting may vary
Arrays/objects become JSON strings

Internal Columns

All exports include tracking columns:

_mage_created_at - ISO 8601 timestamp of creation
_mage_updated_at - ISO 8601 timestamp of last update

{
    "user_id": 123,
    "name": "John Doe",
    "_mage_created_at": "2024-03-04T15:30:45.123456+00:00",
    "_mage_updated_at": "2024-03-04T15:30:45.123456+00:00"
}

Performance Optimization

Parquet Optimization

Use Parquet for Best Performance

file_type: parquet

Parquet provides:

Columnar compression (5-10x smaller files)
Predicate pushdown for faster queries
Schema evolution support
Native type preservation

Compression Settings Mage uses Snappy compression by default:

Good balance of speed and compression
Fast decompression for queries
~2-4x compression ratio

Partition Strategy

Choose Partition Keys WiselyGood partition keys:

High cardinality but not too high (100-1000s of partitions)
Frequently used in WHERE clauses
Evenly distributed data

# Good partitioning
partition_keys:
  - date        # ~365 partitions per year
  - region      # 5-10 regions

# Avoid
partition_keys:
  - user_id     # Too many partitions (millions)
  - timestamp   # Too granular

Partition Size Guidelines

Target 100MB - 1GB per partition
Avoid small files (less than 10MB)
Use date partitioning for time-series data

Delta Lake Optimization

Optimize TablePeriodically compact small files:

OPTIMIZE delta.`s3://bucket/path/to/table`

Vacuum Old FilesRemove old file versions:

VACUUM delta.`s3://bucket/path/to/table` RETAIN 168 HOURS

Z-Order ClusteringFor frequently filtered columns:

OPTIMIZE delta.`s3://bucket/path/to/table`
ZORDER BY (user_id, event_date)

Example: S3 Export with Partitioning

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from pandas import DataFrame
import os

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_to_s3(df: DataFrame, **kwargs) -> None:
    """
    Export data to S3 in Parquet format with date partitioning.
    """
    config_path = os.path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    # Ensure date column exists for partitioning
    if 'event_date' not in df.columns:
        df['event_date'] = pd.to_datetime('today').date()
    
    from mage_integrations.destinations.amazon_s3 import AmazonS3
    
    config = {
        'bucket': 'my-data-lake',
        'object_key_path': 'raw/events',
        'table': 'user_events',
        'file_type': 'parquet',
        'date_partition_format': '%Y/%m/%d',
        'aws_access_key_id': os.environ.get('AWS_ACCESS_KEY_ID'),
        'aws_secret_access_key': os.environ.get('AWS_SECRET_ACCESS_KEY'),
        'aws_region': 'us-west-2',
    }
    
    destination = AmazonS3(config=config, batch_processing=True)
    # Export is handled by Mage pipeline

Testing Connections

# Test S3 connection
from mage_integrations.destinations.amazon_s3 import AmazonS3

config = {'bucket': 'my-bucket', 'aws_access_key_id': '...', 'aws_secret_access_key': '...', 'aws_region': 'us-west-2'}
s3 = AmazonS3(config=config)
try:
    s3.test_connection()
    print('S3 connection successful')
except Exception as e:
    print(f'Connection failed: {e}')

# Test GCS connection
from mage_integrations.destinations.google_cloud_storage import GoogleCloudStorage

config = {'bucket': 'my-bucket', 'google_application_credentials': '/path/to/key.json'}
gcs = GoogleCloudStorage(config=config)
try:
    gcs.test_connection()
    print('GCS connection successful')
except Exception as e:
    print(f'Connection failed: {e}')

Common Issues

S3 Permission Errors

Ensure IAM user/role has:

{
  "Effect": "Allow",
  "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::my-bucket",
    "arn:aws:s3:::my-bucket/*"
  ]
}

GCS Permission Errors

Grant service account roles:

roles/storage.objectCreator - Write access
roles/storage.objectViewer - Read access (for testing)

Or custom IAM permissions:

storage.objects.create
storage.objects.get
storage.buckets.get

Delta Lake Region Errors

Error: “Received redirect without LOCATION”Cause: AWS region mismatchSolution: Ensure aws_region matches S3 bucket region:

aws_region: us-west-2  # Must match bucket region

Data Sources

Data Destinations

Infrastructure

​Supported Cloud Storage

Amazon S3

Google Cloud Storage

Delta Lake (S3)

Delta Lake (Azure)

​Amazon S3

​Configuration

​Features

​File Naming Convention

​Parquet Format (Recommended)

​CSV Format

​Date Partitioning

​S3-Compatible Storage

​Google Cloud Storage (GCS)

​Configuration

​Features

​File Structure

​Authentication Methods

​Delta Lake on S3

​Configuration

​Features

​Delta Lake Structure

​Write Modes

​Partition Overwrite

​Querying Delta Tables

​Delta Lake on Azure

​Configuration

​Features

​Azure Authentication

​Data Type Handling

​Parquet Schema Preservation

​CSV Limitations

​Internal Columns

​Performance Optimization

​Example: S3 Export with Partitioning

​Testing Connections

​Common Issues

​Next Steps

Streaming Destinations

Data Warehouses

Build docs developers (and LLMs) love

Supported Cloud Storage

Amazon S3

Configuration

Features

File Naming Convention

Parquet Format (Recommended)

CSV Format

Date Partitioning

S3-Compatible Storage

Google Cloud Storage (GCS)

Configuration

Features

File Structure

Authentication Methods

Delta Lake on S3

Configuration

Features

Delta Lake Structure

Write Modes

Partition Overwrite

Querying Delta Tables

Delta Lake on Azure

Configuration

Features

Azure Authentication

Data Type Handling

Parquet Schema Preservation

CSV Limitations

Internal Columns

Performance Optimization

Example: S3 Export with Partitioning

Testing Connections

Common Issues

Next Steps