Skip to main content
Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines. It provides a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends.

What is Apache Beam?

Apache Beam evolved from several internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. The model provides a general approach to expressing embarrassingly parallel data processing pipelines that work seamlessly across both batch and streaming data sources.

Quickstart

Get started with Apache Beam in minutes with hands-on examples in Java, Python, and Go

Core concepts

Learn about PCollections, PTransforms, Pipelines, and PipelineRunners

SDKs

Explore language-specific SDKs for Java, Python, Go, and TypeScript

Runners

Execute pipelines on Flink, Spark, Dataflow, and other distributed backends

Examples

Browse comprehensive examples including WordCount, streaming, and ML pipelines

API Reference

Detailed API documentation for Java, Python, and Go SDKs

I/O Connectors

Connect to various data sources and sinks with built-in I/O transforms

Key features

Unified batch and streaming

Write your pipeline logic once and run it on both batch and streaming data sources. Beam’s unified model eliminates the need to maintain separate codebases for batch and streaming processing.

Multi-language SDKs

Beam currently provides SDKs for:
  • Java: Full-featured SDK with extensive ecosystem support
  • Python: Pythonic API with support for data science workflows
  • Go: Idiomatic Go SDK for high-performance pipelines
  • TypeScript: JavaScript/TypeScript SDK for web and Node.js environments

Portable pipelines

Run the same pipeline on multiple execution engines without code changes. Beam supports:
  • DirectRunner: Execute locally for development and testing
  • DataflowRunner: Run on Google Cloud Dataflow
  • FlinkRunner: Execute on Apache Flink clusters
  • SparkRunner: Run on Apache Spark clusters
  • PrismRunner: Local execution using Beam Portability

Core programming model

Beam pipelines are built using four key concepts:
1

PCollection

Represents a distributed dataset that can be bounded (batch) or unbounded (streaming). PCollections are immutable and can contain elements of any type.
2

PTransform

A data processing operation that takes one or more PCollections as input and produces one or more PCollections as output. Common transforms include ParDo, GroupByKey, and Combine.
3

Pipeline

A directed acyclic graph (DAG) of PTransforms and PCollections that defines your entire data processing workflow. Pipelines are constructed programmatically using the SDK.
4

PipelineRunner

Executes your Pipeline on a specific distributed processing backend. The runner translates your Beam pipeline into the appropriate API calls for the target execution engine.

Getting help

The Apache Beam community is active and helpful:

Next steps

Try the quickstart

Build and run your first Apache Beam pipeline in less than 5 minutes

Build docs developers (and LLMs) love