Beam Capability Matrix

Apache Beam provides a portable API layer for building sophisticated data-parallel processing pipelines that may be executed across a diversity of execution engines, or runners. The core concepts of this layer are based upon the Beam Model (formerly referred to as the Dataflow Model), and implemented to varying degrees in each Beam runner. To help clarify the capabilities of individual runners, we’ve created the capability matrix below.

Individual capabilities have been grouped by their corresponding What / Where / When / How question:

For more details on the What / Where / When / How breakdown of concepts, we recommend reading through the Streaming 102 post on O’Reilly Radar.

Note that in the future, we intend to add additional tables beyond the current set, for things like runtime characterstics (e.g. at-least-once vs exactly-once), performance, etc.

Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
ParDo
~
GroupByKey
~
~
Flatten
~
Combine
~
Composite Transforms
~
~
~
~
~
~
~
~
Side Inputs
~
~
Source API
~
~
Metrics
~
~
~
~
~
~
~
~
~
~
Stateful Processing
~
~
~
~
~
~
~
~
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
Base
~
~
Side Inputs
~
~
Splittable DoFn Initiated Checkpointing
~
~
Dynamic Splitting
~
Bundle Finalization
~
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
Base
Side Inputs
~
~
Splittable DoFn Initiated Checkpointing
Dynamic Splitting
Bundle Finalization
~
~
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
Global windows
~
Fixed windows
~
Sliding windows
~
Session windows
~
Custom windows
~
Custom merging windows
~
Timestamp control
~
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
Configurable triggering
~
Event-time triggers
~
Processing-time triggers
~
Count triggers
~
[Meta]data driven triggers

(BEAM-101)
~
Composite triggers
~
~
Allowed lateness
~
Timers
~
~
~
~
~
~
~
~
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
Discarding
~
Accumulating
Accumulating & Retracting

(BEAM-91)
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache Hadoop MapReduceJStormIBM StreamsApache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
Drain
~
~
~
Checkpoint
~
~
~
~