Dask integration is currently not actively supported in Mage. The implementation exists as commented code in the codebase but is not enabled in the current release.
Current Status
Based on the source code analysis, Dask integration has been explored but is not currently active:mage_ai/data_preparation/models/utils.py
mage_ai/data_preparation/models/variable.py(commented imports)- Variable serialization logic (disabled)
Why Dask?
Dask would provide several benefits for Mage users:- Parallel Processing: Scale Pandas workloads across multiple cores or machines
- Larger-than-Memory: Process datasets that don’t fit in RAM
- Familiar API: Use Pandas-like syntax with distributed execution
- Dynamic Task Graphs: Lazy evaluation and optimized execution plans
Alternative Solutions
While native Dask integration is not available, you can still use Dask in Mage:Option 1: Manual Dask Session
Create and manage Dask clients directly in your blocks:Option 2: Dask on Kubernetes
Deploy a Dask cluster on Kubernetes and connect from Mage:Option 3: Use Spark Instead
For production-grade distributed computing, consider using Spark integration:Spark Integration
Full PySpark support with AWS EMR
Kubernetes Executor
Run distributed workloads on K8s
Feature Request
Interested in native Dask integration? We’d love to hear from you:- Vote on the feature request in GitHub Issues
- Join the discussion in Mage Slack
- Contribute to the implementation
Comparison: Dask vs Spark
| Feature | Dask | Spark (Supported) |
|---|---|---|
| API Familiarity | Pandas-like | SQL + DataFrame API |
| Setup Complexity | Low | Medium (requires EMR/cluster) |
| Ecosystem | Python-focused | Multi-language (Python, Scala, Java) |
| Performance | Good for Python workloads | Excellent for JVM workloads |
| Mage Integration | Manual setup required | Native integration |
| Cloud Support | Self-managed | AWS EMR (managed) |
Best Practices for Manual Dask Usage
If you choose to use Dask manually in Mage:- Client Management: Always create clients at the beginning and close them at the end
- Context Sharing: Store the Dask client in
kwargs['context']to share across blocks - Compute Strategically: Use
.compute()only when necessary to trigger execution - Memory Monitoring: Monitor Dask dashboard for memory usage and task graphs
- Chunking: Partition data appropriately with
blocksizeparameter - Persistence: Use
.persist()for intermediate results used multiple times
Related Resources
Dask Documentation
Official Dask documentation
Dask on Kubernetes
Deploy Dask clusters on Kubernetes
Spark Integration
Alternative distributed computing with Spark
GitHub Issues
Request native Dask support