Splittable DoFn in Apache Beam is Ready to Use

blog

2020/12/14

Splittable DoFn in Apache Beam is Ready to Use

Boyuan Zhang

We are pleased to announce that Splittable DoFn (SDF) is ready for use in the Beam Python, Java, and Go SDKs for versions 2.25.0 and later.

In 2017, Splittable DoFn Blog Post proposed to build Splittable DoFn APIs as the new recommended way of building I/O connectors. Splittable DoFn is a generalization of DoFn that gives it the core capabilities of Source while retaining DoFn’s syntax, flexibility, modularity, and ease of coding. Thus, it becomes much easier to develop complex I/O connectors with simpler and reusable code.

SDF has three advantages over the existing UnboundedSource and BoundedSource:

SDF provides a unified set of APIs to handle both unbounded and bounded cases.
SDF enables reading from source descriptors dynamically.
- Taking KafkaIO as an example, within UnboundedSource/BoundedSource API, you must specify the topic and partition you want to read from during pipeline construction time. There is no way for UnboundedSource/BoundedSource to accept topics and partitions as inputs during execution time. But it’s built-in to SDF.
SDF fits in as any node on a pipeline freely with the ability of splitting.
- UnboundedSource/BoundedSource has to be the root node of the pipeline to gain performance benefits from splitting strategies, which limits many real-world usages. This is no longer a limit for an SDF.

As SDF is now ready to use with all the mentioned improvements, it is the recommended way to build the new I/O connectors. Try out building your own Splittable DoFn by following the programming guide. We have provided tonnes of common utility classes such as common types of RestrictionTracker and WatermarkEstimator in Beam SDK, which will help you onboard easily. As for the existing I/O connectors, we have wrapped UnboundedSource and BoundedSource implementations into Splittable DoFns, yet we still encourage developers to convert UnboundedSource/BoundedSource into actual Splittable DoFn implementation to gain more performance benefits.

Many thanks to every contributor who brought this highly anticipated design into the data processing world. We are really excited to see that users benefit from SDF.

Below are some real-world SDF examples for you to explore.

Real world Splittable DoFn examples

Java Examples

Kafka: An I/O connector for Apache Kafka (an open-source distributed event streaming platform).
Watch: Uses a polling function producing a growing set of outputs for each input until a per-input termination condition is met.
Parquet: An I/O connector for Apache Parquet (an open-source columnar storage format).
HL7v2: An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events that occur inside an organization) part of Google’s Cloud Healthcare API.
BoundedSource wrapper: A wrapper which converts an existing BoundedSource implementation to a splittable DoFn.
UnboundedSource wrapper: A wrapper which converts an existing UnboundedSource implementation to a splittable DoFn.

Python Examples

BoundedSourceWrapper: A wrapper which converts an existing BoundedSource implementation to a splittable DoFn.

Go Examples

textio.ReadSdf implements reading from text files using a splittable DoFn.

Real world Splittable DoFn examples

Latest from the blog