beam & capability
2016/03/17
Clarifying & Formalizing Runner CapabilitiesFrances Perry [@francesjperry]
&
Tyler Akidau [@takidau]
With initial code drops complete (Dataflow SDK and Runner, Flink Runner, Spark Runner) and expressed interest in runner implementations for Storm, Hadoop, and Gearpump (amongst others), we wanted to start addressing a big question in the Apache Beam (incubating) community: what capabilities will each runner be able to support?
While we’d love to have a world where all runners support the full suite of semantics included in the Beam Model (formerly referred to as the Dataflow Model), practically speaking, there will always be certain features that some runners can’t provide. For example, a Hadoop-based runner would be inherently batch-based and may be unable to (easily) implement support for unbounded collections. However, that doesn’t prevent it from being extremely useful for a large set of uses. In other cases, the implementations provided by one runner may have slightly different semantics that those provided by another (e.g. even though the current suite of runners all support exactly-once delivery guarantees, an Apache Samza runner, which would be a welcome addition, would currently only support at-least-once).
To help clarify things, we’ve been working on enumerating the key features of the Beam model in a capability matrix for all existing runners, categorized around the four key questions addressed by the model: What / Where / When / How (if you’re not familiar with those questions, you might want to read through Streaming 102 for an overview). This table will be maintained over time as the model evolves, our understanding grows, and runners are created or features added.
Included below is a summary snapshot of our current understanding of the capabilities of the existing runners (see the live version for full details, descriptions, and Jira links); since integration is still under way, the system as whole isn’t yet in a completely stable, usable state. But that should be changing in the near future, and we’ll be updating loud and clear on this blog when the first supported Beam 1.0 release happens.
In the meantime, these tables should help clarify where we expect to be in the very near term, and help guide expectations about what existing runners are capable of, and what features runner implementers will be tackling next.
What is being computed? | |||||||
---|---|---|---|---|---|---|---|
Beam Model | Google Cloud Dataflow | Apache Flink | Apache Spark | ||||
ParDo | |||||||
GroupByKey | |||||||
Flatten | |||||||
Combine | |||||||
Composite Transforms | |||||||
Side Inputs | |||||||
Source API | |||||||
Aggregators | |||||||
Keyed State | |||||||
Where in event time? | |||||||
Beam Model | Google Cloud Dataflow | Apache Flink | Apache Spark | ||||
Global windows | |||||||
Fixed windows | |||||||
Sliding windows | |||||||
Session windows | |||||||
Custom windows | |||||||
Custom merging windows | |||||||
Timestamp control | |||||||
When in processing time? | |||||||
Beam Model | Google Cloud Dataflow | Apache Flink | Apache Spark | ||||
Configurable triggering | |||||||
Event-time triggers | |||||||
Processing-time triggers | |||||||
Count triggers | |||||||
[Meta]data driven triggers | |||||||
Composite triggers | |||||||
Allowed lateness | |||||||
Timers | |||||||
How do refinements relate? | |||||||
Beam Model | Google Cloud Dataflow | Apache Flink | Apache Spark | ||||
Discarding | |||||||
Accumulating | |||||||
Accumulating & Retracting | |||||||
Bounded Splittable DoFn Support Status | |||||||
Beam Model | Google Cloud Dataflow | Apache Flink | Apache Spark | ||||
Base | |||||||
Side Inputs | |||||||
Splittable DoFn Initiated Checkpointing | |||||||
Dynamic Splitting | |||||||
Bundle Finalization | |||||||
Unbounded Splittable DoFn Support Status | |||||||
Beam Model | Google Cloud Dataflow | Apache Flink | Apache Spark | ||||
Base | |||||||
Side Inputs | |||||||
Splittable DoFn Initiated Checkpointing | |||||||
Dynamic Splitting | |||||||
Bundle Finalization | |||||||