Skip to main content
This quickstart will guide you through installing Apache Beam and running your first WordCount pipeline. WordCount is a classic example that demonstrates core Beam concepts: reading data, applying transformations, and writing results.

Choose your language

Install Python SDK

Apache Beam requires Python 3.8 or newer. Install the Beam SDK using pip:
pip install apache-beam
For Google Cloud Dataflow support, install with the gcp extra:
pip install apache-beam[gcp]

Create your pipeline

Create a file named wordcount.py with the following code:
import re
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions

def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
    '--input',
    dest='input',
    default='gs://dataflow-samples/shakespeare/kinglear.txt',
    help='Input file to process.')
parser.add_argument(
    '--output',
    dest='output',
    default='output.txt',
    help='Output file to write results to.')

known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)

with beam.Pipeline(options=pipeline_options) as p:
    # Read the text file into a PCollection
    lines = p | ReadFromText(known_args.input)
    
    # Count the occurrences of each word
    counts = (
        lines
        | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))
    
    # Format the counts into strings
    output = counts | 'Format' >> beam.Map(
        lambda word_count: '%s: %s' % (word_count[0], word_count[1]))
    
    # Write the output
    output | WriteToText(known_args.output)

if __name__ == '__main__':
main()

Run the pipeline

Execute the pipeline locally using the DirectRunner:
python wordcount.py --output output.txt
This will process Shakespeare’s King Lear and write word counts to output.txt-00000-of-00001.

What’s happening?

1

Read input

ReadFromText reads lines from the input file into a PCollection of strings.
2

Split into words

FlatMap applies a regex to extract individual words from each line.
3

Create key-value pairs

Map transforms each word into a tuple of (word, 1).
4

Count occurrences

CombinePerKey groups by word and sums the counts.
5

Format output

Another Map formats each word-count pair as a string.
6

Write results

WriteToText writes the formatted results to output files.

Running on other runners

By default, pipelines run locally using the DirectRunner. To run on a distributed backend:
python wordcount.py \
  --runner=DataflowRunner \
  --project=YOUR_PROJECT_ID \
  --region=us-central1 \
  --temp_location=gs://YOUR_BUCKET/temp \
  --output=gs://YOUR_BUCKET/output

Next steps

Programming guide

Learn about advanced transforms, windowing, and triggers

Pipeline I/O

Connect to databases, message queues, and cloud storage

Testing pipelines

Write unit and integration tests for your pipelines

Tour of Beam

Interactive tutorials covering all Beam concepts

Build docs developers (and LLMs) love