Skip to main content

Why Orchestration Matters

ML systems are more than just training scripts. Production pipelines involve:
  • Data ingestion from multiple sources
  • Preprocessing and feature engineering
  • Training (often with hyperparameter sweeps)
  • Evaluation and model comparison
  • Deployment to staging/production
  • Monitoring and retraining triggers
Doing this manually is error-prone and doesn’t scale. Orchestration platforms automate these workflows, handle failures, and provide observability.

What is a DAG?

Orchestration tools represent workflows as Directed Acyclic Graphs (DAGs):
load_data → preprocess → train → evaluate → deploy
                              ↘        ↗
                            hyperparameter_search
Each node is a task. Edges represent dependencies. The scheduler ensures tasks run in the right order and retries failures.

Platform Comparison

Airflow

Best for: General-purpose workflows, batch jobsPros: Battle-tested, huge ecosystem, Python-basedCons: UI can be clunky, requires careful resource management

Kubeflow Pipelines

Best for: Kubernetes-native ML workflowsPros: Strong artifact tracking, integrates with K8s, UI for pipeline visualizationCons: Complex setup, Kubernetes required

Dagster

Best for: Data pipelines, modern developer experiencePros: Software-defined assets, type system, great UI, testing supportCons: Newer (less mature), smaller community
For greenfield ML projects, Dagster offers the best developer experience. For existing systems, Airflow is the safest bet. If you’re all-in on Kubernetes, Kubeflow provides deep integration.

Apache Airflow

Airflow is the most popular orchestration tool. DAGs are Python code:
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime

with DAG(
    'training_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
) as dag:
    
    load_data = KubernetesPodOperator(
        task_id='load_data',
        image='myorg/data-loader:v1',
        namespace='default',
    )
    
    train_model = KubernetesPodOperator(
        task_id='train_model',
        image='myorg/trainer:v1',
        namespace='default',
    )
    
    load_data >> train_model  # Define dependency
Key features:
  • KubernetesPodOperator: Run each task as a K8s pod (full isolation)
  • Scheduling: Cron-like schedules or event triggers
  • Retries: Automatic retry with exponential backoff
  • Backfilling: Rerun past intervals easily
For production, use Astronomer (managed Airflow) or MWAA (AWS Managed Workflows for Apache Airflow) to avoid infrastructure headaches.

Passing Data Between Tasks

Airflow has two approaches:
  1. XComs (small data): Pass JSON-serializable values
    task_instance.xcom_push(key='model_id', value='bert-v2')
    model_id = task_instance.xcom_pull(key='model_id')
    
  2. Artifacts (large data): Write to S3/GCS, pass path via XCom
    # In first task
    save_to_s3(df, 's3://bucket/processed.parquet')
    task_instance.xcom_push(key='data_path', value='s3://bucket/processed.parquet')
    
    # In second task
    path = task_instance.xcom_pull(key='data_path')
    df = load_from_s3(path)
    
Never pass large datasets through XComs. Always use object storage (S3/GCS/MinIO) for intermediate data.

Kubeflow Pipelines

Kubeflow is Kubernetes-native and provides strong artifact tracking:
from kfp import dsl
from kfp.dsl import Input, Output, Dataset, Model

@dsl.component
def preprocess_data(input_data: Input[Dataset], output_data: Output[Dataset]):
    # Component implementation
    pass

@dsl.component
def train_model(input_data: Input[Dataset], model: Output[Model]):
    # Training logic
    pass

@dsl.pipeline(name='Training Pipeline')
def training_pipeline():
    preprocess = preprocess_data()
    train = train_model(input_data=preprocess.outputs['output_data'])

from kfp import compiler
compiler.Compiler().compile(training_pipeline, 'pipeline.yaml')
Advantages:
  • Artifact tracking: Inputs/outputs are first-class citizens
  • Lineage: Track which data produced which model
  • UI: Visualize DAG and inspect artifacts
  • Vertex AI: Google Cloud offers managed Kubeflow
Kubeflow uses Argo Workflows under the hood for scheduling. Each component runs in its own container.

Dagster

Dagster treats data as assets (tables, models, reports) instead of tasks:
from dagster import asset, AssetExecutionContext
import pandas as pd

@asset
def raw_data() -> pd.DataFrame:
    return pd.read_csv('s3://bucket/data.csv')

@asset
def processed_data(raw_data: pd.DataFrame) -> pd.DataFrame:
    return raw_data.dropna()

@asset
def trained_model(processed_data: pd.DataFrame) -> str:
    model = train(processed_data)
    model_path = 's3://bucket/model.pkl'
    save(model, model_path)
    return model_path
Why assets?
  • Dependencies are inferred from function signatures
  • Type checking catches errors at compile time
  • Easier to test (just call the function!)
  • Built-in data quality checks
Asset checks:
from dagster import asset_check, AssetCheckResult

@asset_check(asset=processed_data)
def no_nulls(processed_data: pd.DataFrame):
    num_nulls = processed_data.isnull().sum().sum()
    return AssetCheckResult(passed=num_nulls == 0, metadata={'num_nulls': num_nulls})
Dagster’s software-defined assets are a paradigm shift. Instead of “run this task”, you think “materialize this asset”. This makes pipelines more declarative and testable.

Choosing a Platform

CriteriaAirflowKubeflowDagster
MaturityVery highHighMedium
Learning curveMediumHighLow
K8s requiredNoYesNo
Artifact trackingManualBuilt-inBuilt-in
TestingHardMediumEasy
Best forBatch ETLK8s ML workflowsData pipelines
Don’t over-engineer early. Start with a simple Makefile or shell script. Graduate to orchestration when you have >5 tasks or need scheduling.

Integration with Training

All platforms integrate with experiment tracking:
# In your training task/component
import wandb

wandb.init(project='my-project', tags=['pipeline', 'production'])
train_model()
wandb.log_artifact(model, name=f'model-{run_id}', type='model')
The orchestrator provides context (run ID, pipeline name) that you can pass to W&B/MLflow for lineage. For simpler workflows, Modal offers serverless functions:
import modal

stub = modal.Stub('training-pipeline')

@stub.function(gpu='A10G', timeout=3600)
def train():
    # Training code
    pass

@stub.local_entrypoint()
def main():
    train.remote()  # Runs on Modal's infrastructure
Modal handles containers, scheduling, and scaling. Great for prototypes and small teams.

Hands-On Examples

Explore orchestration in Module 4:
  • Airflow DAGs for training and inference
  • Kubeflow Pipelines with artifact tracking
  • Dagster asset-based workflows
  • Deploying Modal functions from pipelines

Next Steps

Model Serving

Deploy models from pipelines

Monitoring

Track pipeline health

Further Reading

Build docs developers (and LLMs) love