back to collapsed details

What is being computed?

ParDo
GroupByKey
Flatten
Combine
Composite Transforms
Side Inputs
Source API
Metrics
Stateful Processing
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner

Yes : fully supported


Batch mode uses large bundle sizes. Streaming uses smaller bundle sizes.

Yes : fully supported


ParDo itself, as per-element transformation with UDFs, is fully supported by Flink for both batch and streaming.

Yes : fully supported


ParDo applies per-element transformations as Spark FlatMapFunction.

Partially : fully supported in batch mode


ParDo applies per-element transformations as Spark FlatMapFunction.

Yes : fully supported


Supported with per-element transformation.

Yes : fully supported


Yes : fully supported


Yes : fully supported




Yes : fully supported


Yes : fully supported


Uses Flink's keyBy for key grouping. When grouping by window in streaming (creating the panes) the Flink runner uses the Beam code. This guarantees support for all windowing and triggering mechanisms.

Partially : fully supported in batch mode


Using Spark's <tt>groupByKey</tt>. GroupByKey with multiple trigger firings in streaming mode is a work in progress.

Partially : fully supported in batch mode


Using Spark's <tt>groupByKey</tt>.

Yes : fully supported


Uses Samza's partitionBy for key grouping and Beam's logic for window aggregation and triggering.

Yes : fully supported


Yes : fully supported


Yes : fully supported




Yes : fully supported


Yes : fully supported


Yes : fully supported


Partially : fully supported in batch mode


Some corner cases like flatten on empty collections are not yet supported.

Yes : fully supported


Yes : fully supported


Yes : fully supported


Yes : fully supported




Yes : efficient execution


Yes : fully supported


Uses a combiner for pre-aggregation for batch and streaming.

Yes : fully supported


Using Spark's <tt>combineByKey</tt> and <tt>aggregate</tt> functions.

Partially : fully supported in batch mode


Using Spark's <tt>Aggregator</tt> and agg function

Yes : fully supported


Use combiner for efficient pre-aggregation.

Yes : fully supported


Batch mode uses pre-aggregation

Yes : fully supported


Batch mode uses pre-aggregation

Yes : fully supported




Partially : supported via inlining


Currently composite transformations are inlined during execution. The structure is later recreated from the names, but other transform level information (if added to the model) will be lost.

Partially : supported via inlining


Partially : supported via inlining


Partially : supported via inlining only in batch mode


Partially : supported via inlining


Yes : fully supported


Partially : supported via inlining


Partially : supported via inlining




Yes : some size restrictions in streaming


Batch mode supports a distributed implementation, but streaming mode may force some size restrictions. Neither mode is able to push lookups directly up into key-based sources.

Yes : some size restrictions in streaming


Batch mode supports a distributed implementation, but streaming mode may force some size restrictions. Neither mode is able to push lookups directly up into key-based sources.

Yes : fully supported


Using Spark's broadcast variables. In streaming mode, side inputs may update but only between micro-batches.

Partially : fully supported in batch mode


Using Spark's broadcast variables.

Yes : fully supported


Uses Samza's broadcast operator to distribute the side inputs.

Yes : fully supported


Partially : with restrictions


Supported only when the side input source is bounded and windowing uses global window

Yes : fully supported




Yes : fully supported


Support includes autotuning features (https://cloud.google.com/dataflow/service/dataflow-service-desc#autotuning-features).

Yes : fully supported


Yes : fully supported


Partially : bounded source only


Using Spark's DatasourceV2 API in microbatch mode (Continuous streaming mode is tagged experimental in spark and does not support aggregation).

Yes : fully supported


Yes : fully supported


Yes : fully supported


Yes : fully supported




Partially


Gauge metrics are not supported. All other metric types are supported.

Partially : All metrics types are supported.


Only attempted values are supported. No committed values for metrics.

Partially : All metric types are supported.


Only attempted values are supported. No committed values for metrics.

Partially : All metric types are supported in batch mode.


Only attempted values are supported. No committed values for metrics.

Partially : Counter and Gauge are supported.


Only attempted values are supported. No committed values for metrics.

No : not implemented


Partially : All metrics types supported, both in batching and streaming mode.


Doesn't differentiate between committed and attempted values.

No : not implemented




Partially : non-merging windows


State is supported for non-merging windows. The MapState, SetState, and MultimapState state types are supported in the following scenarios: Java pipelines that don't use Streaming Engine; Java pipelines that use Streaming Engine and version 2.58.0 or later of the Java SDK. SetState, MapState, and MultimapState are not supported for pipelines that use Runner v2.

Partially : non-merging windows


State is supported for non-merging windows. SetState and MapState are not yet supported.

Partially : full support in batch mode


No : not implemented


Partially : non-merging windows


States are backed up by either rocksDb KV store or in-memory hash map, and persist using changelog.

No : not implemented


Partially : non-merging windows


No : not implemented




Last updated on 2024/11/14

Have you found everything you were looking for?

Was it all useful and clear? Is there anything that you would like to change? Let us know!