Overview
The Spark integration uses thepyspark executor type to run pipeline blocks on AWS EMR clusters. Mage automatically:
- Uploads execution scripts to S3
- Provisions or connects to EMR clusters
- Submits Spark jobs with custom configurations
- Manages cluster lifecycle and resource allocation
The PySpark executor is implemented in
mage_ai/data_preparation/executors/pyspark_block_executor.py and requires AWS credentials and an S3 bucket.Prerequisites
AWS Account and Credentials
Ensure you have AWS credentials configured with permissions for:
- S3 bucket access (read/write)
- EMR cluster management
- EC2 instance creation
S3 Bucket
Create an S3 bucket for storing:
- Execution scripts
- Bootstrap scripts
- Spark logs and output
- Remote variables
Configuration
Project-Level Configuration
Configure Spark in your project’smetadata.yaml file:
metadata.yaml
Default Instance Memory Mapping
Mage automatically configures driver memory based on instance type (frommage_ai/services/aws/emr/config.py):
Pipeline-Level Configuration
Create a PySpark pipeline or configure an existing pipeline:Execution Flow
When a PySpark block executes, Mage follows this workflow (frompyspark_block_executor.py:38-55):
Upload Execution Script
Generate and upload block execution script to S3:The script includes:
- Block code
- Global variables
- Pipeline configuration
- Execution partition information
Docker Integration
Mage provides a Docker image with Spark pre-installed:Using the Official Spark Dockerfile
integrations/spark/Dockerfile
Build and Run
Custom Spark Cluster with Helm
Mage includes a custom Spark cluster configuration for Kubernetes deployments:Setup Instructions
Fromintegrations/custom_spark/README.md:
Connect Mage to Spark Cluster
Set the Spark master URL in your environment:metadata.yaml:
Writing PySpark Code
Access the Spark session in your blocks:Security Configuration
EMR Security Groups
To access EMR clusters from Mage, configure security group inbound rules:| Type | Protocol | Port Range | Source |
|---|---|---|---|
| Custom TCP | TCP | 8998 | Your IP/Security Group |
| SSH | TCP | 22 | Your IP (if using SSH tunneling) |
IAM Permissions
Your Mage instance needs the following IAM permissions:Troubleshooting
Cluster provisioning fails
Cluster provisioning fails
- Verify AWS credentials are configured correctly
- Check EC2 instance quota limits for requested instance types
- Ensure S3 bucket exists and is accessible
- Verify security group IDs are correct
Job submission hangs or fails
Job submission hangs or fails
- Check security group rules allow access from Mage to EMR (port 8998)
- Verify S3 bucket permissions for script upload
- Review EMR cluster logs in S3
- Ensure JAR files in
spark_jarsare accessible
Out of memory errors
Out of memory errors
- Increase instance types (e.g., r5.4xlarge → r5.8xlarge)
- Adjust
spark.driver.memoryandspark.executor.memorysettings - Increase
slave_instance_countfor more distributed memory - Configure auto-scaling with
scaling_policy
Best Practices
- Resource Sizing: Start with smaller instances and scale up based on actual usage
- Auto-Scaling: Use
scaling_policyto optimize costs and handle variable workloads - S3 Optimization: Store intermediate results in S3 with partitioning for better performance
- Custom JARs: Pre-upload frequently used JAR files to S3 and reference them in
spark_jars - Bootstrap Scripts: Use bootstrap scripts for cluster-wide dependencies and configurations
- Monitoring: Enable CloudWatch metrics and configure log aggregation in S3
Related Resources
PySpark Executor
Legacy Spark executor documentation
Compute Overview
Overview of all compute integrations
AWS EMR
AWS EMR official documentation
PySpark API
Apache Spark PySpark API reference