blog
2020/12/14
Splittable DoFn in Apache Beam is Ready to UseBoyuan Zhang
We are pleased to announce that Splittable DoFn (SDF) is ready for use in the Beam Python, Java, and Go SDKs for versions 2.25.0 and later.
In 2017, Splittable DoFn Blog Post proposed
to build Splittable DoFn APIs as the new recommended way of
building I/O connectors. Splittable DoFn is a generalization of DoFn
that gives it the core
capabilities of Source
while retaining DoFn
’s syntax, flexibility, modularity, and ease of
coding. Thus, it becomes much easier to develop complex I/O connectors with simpler and reusable
code.
SDF has three advantages over the existing UnboundedSource
and BoundedSource
:
- SDF provides a unified set of APIs to handle both unbounded and bounded cases.
- SDF enables reading from source descriptors dynamically.
- Taking KafkaIO as an example, within
UnboundedSource
/BoundedSource
API, you must specify the topic and partition you want to read from during pipeline construction time. There is no way forUnboundedSource
/BoundedSource
to accept topics and partitions as inputs during execution time. But it’s built-in to SDF.
- Taking KafkaIO as an example, within
- SDF fits in as any node on a pipeline freely with the ability of splitting.
UnboundedSource
/BoundedSource
has to be the root node of the pipeline to gain performance benefits from splitting strategies, which limits many real-world usages. This is no longer a limit for an SDF.
As SDF is now ready to use with all the mentioned improvements, it is the recommended
way to build the new I/O connectors. Try out building your own Splittable DoFn by following the
programming guide. We
have provided tonnes of common utility classes such as common types of RestrictionTracker
and
WatermarkEstimator
in Beam SDK, which will help you onboard easily. As for the existing I/O
connectors, we have wrapped UnboundedSource
and BoundedSource
implementations into Splittable
DoFns, yet we still encourage developers to convert UnboundedSource
/BoundedSource
into actual
Splittable DoFn implementation to gain more performance benefits.
Many thanks to every contributor who brought this highly anticipated design into the data processing world. We are really excited to see that users benefit from SDF.
Below are some real-world SDF examples for you to explore.
Real world Splittable DoFn examples
Java Examples
- Kafka: An I/O connector for Apache Kafka (an open-source distributed event streaming platform).
- Watch: Uses a polling function producing a growing set of outputs for each input until a per-input termination condition is met.
- Parquet: An I/O connector for Apache Parquet (an open-source columnar storage format).
- HL7v2: An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events that occur inside an organization) part of Google’s Cloud Healthcare API.
- BoundedSource wrapper: A wrapper which converts an existing BoundedSource implementation to a splittable DoFn.
- UnboundedSource wrapper: A wrapper which converts an existing UnboundedSource implementation to a splittable DoFn.
Python Examples
- BoundedSourceWrapper: A wrapper which converts an existing BoundedSource implementation to a splittable DoFn.
Go Examples
- textio.ReadSdf implements reading from text files using a splittable DoFn.