The Kubernetes executor enables running Mage pipelines and individual blocks as Kubernetes Jobs, providing resource isolation, scalability, and fine-grained control over compute resources.
Overview
Mage’s Kubernetes integration creates dynamic Kubernetes Jobs for pipeline runs and block executions. Each job runs in an isolated container with configurable resources, allowing you to:
Scale pipeline execution across your Kubernetes cluster
Isolate resource-intensive blocks
Configure CPU, memory, and GPU resources per block
Use node selectors, affinities, and tolerations
Manage secrets and environment variables
The K8s executor is implemented in mage_ai/data_preparation/executors/k8s_block_executor.py and k8s_pipeline_executor.py.
Prerequisites
Kubernetes Cluster
You need a running Kubernetes cluster with:
kubectl configured and authenticated
Sufficient resources for job execution
Network access between Mage pods and worker jobs
Mage Deployed on Kubernetes
Mage must be running as a pod in the Kubernetes cluster with:
Service account with job creation permissions
RBAC roles configured
Access to shared storage (PVC or volume mounts)
Install Dependencies
Kubernetes client library is included in the all extras: pip install "mage-ai[all]"
This installs kubernetes==33.1.0
Kubernetes Setup
RBAC Configuration
Create the necessary RBAC resources (from kube/app.yaml):
apiVersion : v1
kind : ServiceAccount
metadata :
name : mage-user
---
apiVersion : rbac.authorization.k8s.io/v1
kind : Role
metadata :
namespace : default
name : job-manager
rules :
- apiGroups : [ "" ] # "" indicates the core API group
resources : [ "pods" ]
verbs : [ "get" , "list" , "watch" ]
- apiGroups : [ "batch" , "extensions" ]
resources : [ "jobs" , "jobs/status" ]
verbs : [ "create" , "delete" , "get" ]
---
apiVersion : rbac.authorization.k8s.io/v1
kind : RoleBinding
metadata :
name : mage-job-manager
namespace : default
subjects :
- kind : ServiceAccount
name : mage-user
namespace : default
roleRef :
kind : Role
name : job-manager
apiGroup : rbac.authorization.k8s.io
Apply the configuration:
kubectl apply -f kube/app.yaml
Deploy Mage Pod
kube/app.yaml (continued)
apiVersion : v1
kind : Pod
metadata :
name : mage-server
spec :
containers :
- name : mage-server
image : mageai/mageai:latest
ports :
- containerPort : 6789
volumeMounts :
- name : mage-fs
mountPath : /home/src
env :
- name : KUBE_NAMESPACE
valueFrom :
fieldRef :
fieldPath : metadata.namespace
volumes :
- name : mage-fs
hostPath :
path : /path/to/mage_project
serviceAccountName : mage-user
Using hostPath is suitable for single-node development clusters. For production, use PersistentVolumeClaims (PVC) or cloud storage.
Configuration
Project-Level Configuration
Configure Kubernetes executor defaults in your project’s metadata.yaml:
# Set K8s as default executor
executor_type : k8s
# Kubernetes executor configuration
k8s_executor_config :
# Job name prefix (supports template variables)
job_name_prefix : "mage-job"
# Namespace for jobs
namespace : "default"
# Service account for jobs
service_account_name : "mage-user"
# Resource requests (minimum guaranteed resources)
resource_requests :
cpu : "500m"
memory : "512Mi"
# Resource limits (maximum allowed resources)
resource_limits :
cpu : "2000m"
memory : "2Gi"
# Container configuration
container_config :
name : "mage-job-container"
image : "mageai/mageai:latest"
image_pull_policy : "IfNotPresent"
Advanced Configuration
For more granular control, use the detailed configuration format (from mage_ai/services/k8s/config.py):
Full Configuration
Minimal Configuration
k8s_executor_config :
# Job metadata
metadata :
namespace : "default"
labels :
app : "mage"
environment : "production"
annotations :
description : "Mage pipeline execution job"
# Container spec
container :
name : "mage-job-container"
image : "mageai/mageai:latest"
image_pull_policy : "IfNotPresent"
# Environment variables
env :
- name : "ENV_VAR_NAME"
value : "value"
- name : "SECRET_VAR"
valueFrom :
secretKeyRef :
name : "mage-secrets"
key : "api-key"
# Volume mounts
volume_mounts :
- name : "mage-data"
mountPath : "/home/src"
- name : "config"
mountPath : "/app/config"
readOnly : true
# Resource requirements
resources :
requests :
cpu : "1000m"
memory : "2Gi"
limits :
cpu : "4000m"
memory : "8Gi"
# Pod spec
pod :
service_account_name : "mage-user"
# Node selector
node_selector :
workload-type : "data-processing"
instance-type : "memory-optimized"
# Tolerations
tolerations :
- key : "dedicated"
operator : "Equal"
value : "data-jobs"
effect : "NoSchedule"
# Affinity rules
affinity :
nodeAffinity :
requiredDuringSchedulingIgnoredDuringExecution :
nodeSelectorTerms :
- matchExpressions :
- key : "kubernetes.io/arch"
operator : "In"
values :
- "amd64"
podAntiAffinity :
preferredDuringSchedulingIgnoredDuringExecution :
- weight : 100
podAffinityTerm :
labelSelector :
matchExpressions :
- key : "app"
operator : "In"
values :
- "mage"
topologyKey : "kubernetes.io/hostname"
# Scheduler name
scheduler_name : "default-scheduler"
# Volumes
volumes :
- name : "mage-data"
persistentVolumeClaim :
claimName : "mage-pvc"
- name : "config"
configMap :
name : "mage-config"
# Image pull secrets
image_pull_secrets : "docker-registry-secret"
# Job spec
job :
# Maximum time for job to complete (in seconds)
active_deadline_seconds : 3600
# Number of retries before marking job as failed
backoff_limit : 2
# TTL for job cleanup after completion (in seconds)
ttl_seconds_after_finished : 86400
Pipeline-Level Configuration
Override configuration for specific pipelines:
pipelines/my_pipeline/metadata.yaml
executor_type : k8s
executor_config :
resource_requests :
cpu : "2000m"
memory : "4Gi"
resource_limits :
cpu : "8000m"
memory : "16Gi"
# Use template variable for namespace
namespace : "{{ env_var('K8S_NAMESPACE') }}"
# Dynamic job name based on trigger
job_name_prefix : "pipeline-{trigger_name}"
The job_name_prefix supports the template variable {trigger_name} which will be replaced with the trigger name.
Block-Level Configuration
Configure individual blocks with specific resource requirements:
executor_type : k8s
executor_config :
# Heavy computation block needs more resources
resource_requests :
cpu : "4000m"
memory : "8Gi"
resource_limits :
cpu : "8000m"
memory : "16Gi"
# Run on GPU nodes
container :
resources :
requests :
nvidia.com/gpu : "1"
limits :
nvidia.com/gpu : "1"
pod :
node_selector :
node-type : "gpu"
Execution Flow
When a block executes with the K8s executor, the following happens (from k8s_block_executor.py:30-60):
Load Configuration
Merge configurations from project, pipeline, and block levels: executor_config_dict = pipeline.repo_config.k8s_executor_config or dict ()
if block.executor_config is not None :
executor_config_dict = merge_dict(
executor_config_dict,
block.executor_config,
)
Create Job Name
Generate unique job name: job_name = f ' { MAGE_CLUSTER_UUID } - { job_name_prefix } -block- { block_run_id } '
Build Job Spec
Create Kubernetes Job specification with:
Pod template from Mage server pod
Command to execute the block
Resource requirements
Environment variables
Volume mounts
Submit Job
Create the job in Kubernetes: batch_api_client.create_namespaced_job(
body = job,
namespace = namespace,
)
Monitor Execution
Poll job status every 5 seconds until completion: while not job_completed:
api_response = batch_api_client.read_namespaced_job(
name = job_name,
namespace = namespace
)
if api_response.status.succeeded or api_response.status.failed:
job_completed = True
time.sleep( 5 )
Cleanup
Delete the job after completion (controlled by ttl_seconds_after_finished)
Job Manager
The K8s Job Manager handles job lifecycle (from mage_ai/services/k8s/job_manager.py):
Configuration Inheritance
Jobs automatically inherit configuration from the Mage server pod:
def merge_pod_spec ( pod_spec , command ):
# Inherit from Mage server pod:
# - Environment variables
# - Volume mounts
# - Tolerations (if not overridden)
# - Node selector (if not overridden)
# - Image pull secrets
# - Scheduler name
mage_server_pod_spec = pod_config.spec
pod_spec.volumes.extend(mage_server_pod_spec.volumes)
if not pod_spec.tolerations:
pod_spec.tolerations = mage_server_pod_spec.tolerations
Volume Optimization
The job manager automatically filters volumes to only include those referenced by volume mounts:
def filter_used_volumes ( pod_spec ):
used_volume_names = {vm.name for vm in pod_spec.containers[ 0 ].volume_mounts}
pod_spec.volumes = [vol for vol in pod_spec.volumes if vol.name in used_volume_names]
Environment Variables
The Mage server pod must have these environment variables:
env :
- name : KUBE_POD_NAME
valueFrom :
fieldRef :
fieldPath : metadata.name
- name : KUBE_NAMESPACE
valueFrom :
fieldRef :
fieldPath : metadata.namespace
These are used to discover the current pod configuration and inherit settings.
Storage Configuration
For production deployments, use PersistentVolumeClaims:
PVC Definition
Use in Pod
Use in K8s Config
apiVersion : v1
kind : PersistentVolumeClaim
metadata :
name : mage-pvc
spec :
accessModes :
- ReadWriteMany # Required for multiple pods
resources :
requests :
storage : 100Gi
storageClassName : "standard" # Or your preferred storage class
The PVC must use ReadWriteMany (RWX) access mode since both the Mage server pod and job pods need to access the same files.
Resource Management
CPU and Memory
Configure resources using Kubernetes resource syntax:
resources :
requests :
cpu : "1000m" # 1 CPU core (1000 millicores)
memory : "2Gi" # 2 GiB of memory
limits :
cpu : "4000m" # Maximum 4 CPU cores
memory : "8Gi" # Maximum 8 GiB of memory
Units:
CPU: millicores (m) or cores (e.g., 500m = 0.5 cores, 2 = 2 cores)
Memory: Mi (mebibytes), Gi (gibibytes), M (megabytes), G (gigabytes)
GPU Resources
For GPU-accelerated workloads:
k8s_executor_config :
container :
resources :
requests :
nvidia.com/gpu : "1"
limits :
nvidia.com/gpu : "1"
pod :
node_selector :
accelerator : "nvidia-tesla-v100"
Node Selection
Control which nodes run your jobs:
Node Selector
Node Affinity
Tolerations
pod :
node_selector :
workload-type : "data-processing"
instance-type : "memory-optimized"
Namespace Configuration
The namespace can be configured with template variables:
k8s_executor_config :
# Use environment variable
namespace : "{{ env_var('K8S_NAMESPACE') }}"
# Or use global variables at runtime
# namespace will be rendered with pipeline global_vars
The namespace is resolved at runtime (from k8s_block_executor.py:38-44):
if self .executor_config.namespace:
namespace = Template( self .executor_config.namespace).render(
variables = lambda x : global_vars.get(x) if global_vars else None ,
** get_template_vars()
)
else :
namespace = DEFAULT_NAMESPACE
Secrets Management
Pass secrets to jobs using Kubernetes Secrets:
Create Secret
kubectl create secret generic mage-secrets \
--from-literal=api-key=your-secret-key \
--from-literal=db-password=your-password
Reference in Configuration
k8s_executor_config :
container :
env :
- name : API_KEY
valueFrom :
secretKeyRef :
name : mage-secrets
key : api-key
- name : DB_PASSWORD
valueFrom :
secretKeyRef :
name : mage-secrets
key : db-password
Monitoring and Debugging
View Job Status
# List all jobs
kubectl get jobs -n default
# Describe specific job
kubectl describe job < job-nam e > -n default
# View job logs
kubectl logs job/ < job-nam e > -n default
Common Issues
Job creation fails with permission denied
Verify RBAC configuration: # Check service account
kubectl get serviceaccount mage-user -n default
# Check role binding
kubectl get rolebinding mage-job-manager -n default
# Verify permissions
kubectl auth can-i create jobs --as=system:serviceaccount:default:mage-user
Job pods stuck in Pending state
Common causes:
Insufficient cluster resources
Node selector not matching any nodes
PVC not bound
Image pull errors
Debug: kubectl describe pod < pod-nam e >
kubectl get events --sort-by= '.lastTimestamp'
Ensure:
PVC exists and is bound
Access mode is ReadWriteMany
Storage class supports RWX
Volume mounts reference existing volumes
Job succeeds but block fails
Check:
Job logs: kubectl logs job/<job-name>
Exit code: kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'
Mage server logs for job manager errors
Best Practices
Resource Requests vs Limits
Set requests to typical usage for better scheduling
Set limits higher to handle peaks without OOM kills
Monitor actual usage and adjust accordingly
Job Cleanup
Configure ttl_seconds_after_finished to auto-delete completed jobs
Set to 86400 (24 hours) for debugging, lower for production
Use backoff_limit: 0 to prevent automatic retries if not needed
Storage Strategy
Use PVCs with RWX for shared storage
Consider NFS, EFS (AWS), or Filestore (GCP) for multi-pod access
Use separate PVCs for data vs. code if possible
Node Selection
Use node selectors for predictable placement
Use taints and tolerations for dedicated node pools
Consider pod affinity/anti-affinity for performance
Security
Use separate service accounts for different pipeline types
Implement RBAC with least privilege
Store credentials in Kubernetes Secrets
Use Pod Security Policies or Pod Security Standards
Monitoring
Enable resource metrics: kubectl top pods
Set up alerts for job failures
Monitor job completion times
Track resource utilization vs. requests/limits
Production Deployment Example
Complete production-ready configuration:
executor_type : k8s
k8s_executor_config :
job_name_prefix : "mage-{trigger_name}"
metadata :
namespace : "data-pipelines"
labels :
app : "mage"
environment : "production"
team : "data"
container :
image : "your-registry/mageai:v1.0.0"
image_pull_policy : "Always"
env :
- name : "ENVIRONMENT"
value : "production"
- name : "DB_PASSWORD"
valueFrom :
secretKeyRef :
name : "mage-db-secrets"
key : "password"
resources :
requests :
cpu : "1000m"
memory : "2Gi"
limits :
cpu : "4000m"
memory : "8Gi"
volume_mounts :
- name : "mage-data"
mountPath : "/home/src"
pod :
service_account_name : "mage-production"
node_selector :
workload-type : "data-processing"
tolerations :
- key : "data-pipelines"
operator : "Equal"
value : "true"
effect : "NoSchedule"
volumes :
- name : "mage-data"
persistentVolumeClaim :
claimName : "mage-data-pvc"
image_pull_secrets : "docker-registry"
job :
active_deadline_seconds : 7200
backoff_limit : 1
ttl_seconds_after_finished : 3600
Compute Overview Overview of all compute integrations
Spark Integration Distributed data processing with PySpark
Kubernetes Docs Official Kubernetes documentation
K8s Jobs Kubernetes Jobs documentation