Skip to main content
Dask integration is currently not actively supported in Mage. The implementation exists as commented code in the codebase but is not enabled in the current release.

Current Status

Based on the source code analysis, Dask integration has been explored but is not currently active:
mage_ai/data_preparation/models/utils.py
# def dask_from_pandas(df: pd.DataFrame) -> dd:
#     # Dask DataFrame conversion logic (commented out)
References exist in:
  • mage_ai/data_preparation/models/variable.py (commented imports)
  • Variable serialization logic (disabled)

Why Dask?

Dask would provide several benefits for Mage users:
  1. Parallel Processing: Scale Pandas workloads across multiple cores or machines
  2. Larger-than-Memory: Process datasets that don’t fit in RAM
  3. Familiar API: Use Pandas-like syntax with distributed execution
  4. Dynamic Task Graphs: Lazy evaluation and optimized execution plans

Alternative Solutions

While native Dask integration is not available, you can still use Dask in Mage:

Option 1: Manual Dask Session

Create and manage Dask clients directly in your blocks:
from dask.distributed import Client
import dask.dataframe as dd

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

@data_loader
def load_data(*args, **kwargs):
    # Create Dask client
    client = Client(n_workers=4, threads_per_worker=2)
    
    # Store in context for other blocks
    kwargs['context']['dask_client'] = client
    
    # Read data with Dask
    ddf = dd.read_csv('s3://bucket/data/*.csv')
    
    return ddf

Option 2: Dask on Kubernetes

Deploy a Dask cluster on Kubernetes and connect from Mage:
1

Deploy Dask Cluster

Use Helm to deploy Dask:
helm repo add dask https://helm.dask.org/
helm install my-dask dask/dask
2

Get Scheduler Address

kubectl get service my-dask-scheduler
3

Connect from Mage

from dask.distributed import Client

client = Client('tcp://my-dask-scheduler:8786')

Option 3: Use Spark Instead

For production-grade distributed computing, consider using Spark integration:

Spark Integration

Full PySpark support with AWS EMR

Kubernetes Executor

Run distributed workloads on K8s

Feature Request

Interested in native Dask integration? We’d love to hear from you:

Comparison: Dask vs Spark

FeatureDaskSpark (Supported)
API FamiliarityPandas-likeSQL + DataFrame API
Setup ComplexityLowMedium (requires EMR/cluster)
EcosystemPython-focusedMulti-language (Python, Scala, Java)
PerformanceGood for Python workloadsExcellent for JVM workloads
Mage IntegrationManual setup requiredNative integration
Cloud SupportSelf-managedAWS EMR (managed)

Best Practices for Manual Dask Usage

If you choose to use Dask manually in Mage:
  1. Client Management: Always create clients at the beginning and close them at the end
  2. Context Sharing: Store the Dask client in kwargs['context'] to share across blocks
  3. Compute Strategically: Use .compute() only when necessary to trigger execution
  4. Memory Monitoring: Monitor Dask dashboard for memory usage and task graphs
  5. Chunking: Partition data appropriately with blocksize parameter
  6. Persistence: Use .persist() for intermediate results used multiple times

Dask Documentation

Official Dask documentation

Dask on Kubernetes

Deploy Dask clusters on Kubernetes

Spark Integration

Alternative distributed computing with Spark

GitHub Issues

Request native Dask support

Build docs developers (and LLMs) love