Introduction

Apache Beam describes itself as follows:

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org

The aim of this book is to dissect the intro and shed more light on the internals.

The key concepts in the Beam programming model are:

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.

  • PTransform: represents a computation that transforms input PCollections into output PCollections.

  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.

  • PipelineRunner: specifies where and how the pipeline should execute.

Let’s start with PCollection.