Clarifying & Formalizing Runner Capabilities

With initial code drops complete (Dataflow SDK and Runner, Flink Runner, Spark Runner) and expressed interest in runner implementations for Storm, Hadoop, and Gearpump (amongst others), we wanted to start addressing a big question in the Apache Beam (incubating) community: what capabilities will each runner be able to support?

While we’d love to have a world where all runners support the full suite of semantics included in the Beam Model (formerly referred to as the Dataflow Model), practically speaking, there will always be certain features that some runners can’t provide. For example, a Hadoop-based runner would be inherently batch-based and may be unable to (easily) implement support for unbounded collections. However, that doesn’t prevent it from being extremely useful for a large set of uses. In other cases, the implementations provided by one runner may have slightly different semantics that those provided by another (e.g. even though the current suite of runners all support exactly-once delivery guarantees, an Apache Samza runner, which would be a welcome addition, would currently only support at-least-once).

To help clarify things, we’ve been working on enumerating the key features of the Beam model in a capability matrix for all existing runners, categorized around the four key questions addressed by the model: What / Where / When / How (if you’re not familiar with those questions, you might want to read through Streaming 102 for an overview). This table will be maintained over time as the model evolves, our understanding grows, and runners are created or features added.

Included below is a summary snapshot of our current understanding of the capabilities of the existing runners (see the live version for full details, descriptions, and Jira links); since integration is still under way, the system as whole isn’t yet in a completely stable, usable state. But that should be changing in the near future, and we’ll be updating loud and clear on this blog when the first supported Beam 1.0 release happens.

In the meantime, these tables should help clarify where we expect to be in the very near term, and help guide expectations about what existing runners are capable of, and what features runner implementers will be tackling next.

Beam ModelGoogle Cloud DataflowApache FlinkApache Spark
ParDo
GroupByKey
~
Flatten
Combine
Composite Transforms
~
~
~
Side Inputs
~
~
Source API
~
Aggregators
~
~
~
~
Keyed State
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark
Global windows
Fixed windows
~
Sliding windows
Session windows
Custom windows
Custom merging windows
Timestamp control
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark
Configurable triggering
Event-time triggers
Processing-time triggers
Count triggers
[Meta]data driven triggers
Composite triggers
Allowed lateness
Timers
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark
Discarding
Accumulating
Accumulating & Retracting
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark
Base
~
~
Side Inputs
~
~
Splittable DoFn Initiated Checkpointing
~
~
Dynamic Splitting
~
Bundle Finalization
~
Beam ModelGoogle Cloud DataflowApache FlinkApache Spark
Base
~
Side Inputs
Splittable DoFn Initiated Checkpointing
~
Dynamic Splitting
Bundle Finalization
~