Overview: Developing a new I/O connector

A guide for users who need to connect to a data store that isn’t supported by the Built-in I/O connectors

To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector. A connector usually consists of a source and a sink. All Beam sources and sinks are composite transforms; however, the implementation of your custom I/O depends on your use case. Here are the recommended steps to get started:

  1. Read this overview and choose your implementation. You can email the Beam dev mailing list with any questions you might have. In addition, you can check if anyone else is working on the same I/O connector.

  2. If you plan to contribute your I/O connector to the Beam community, see the Apache Beam contribution guide.

  3. Read the PTransform style guide for additional style guide recommendations.

Sources

For bounded (batch) sources, there are currently two options for creating a Beam source:

  1. Use Splittable DoFn.

  2. Use ParDo and GroupByKey.

Splittable DoFn is the recommended option, as it’s the most recent source framework for both bounded and unbounded sources. This is meant to replace the Source APIs( BoundedSource and UnboundedSource) in the new system. Read Splittable DoFn Programming Guide for how to write one Splittable DoFn. For more information, see the roadmap for multi-SDK connector efforts.

For Java and Python unbounded (streaming) sources, you must use the Splittable DoFn, which supports features that are useful for streaming pipelines, including checkpointing, controlling watermark, and tracking backlog.

When to use the Splittable DoFn interface

If you are not sure whether to use Splittable DoFn, feel free to email the Beam dev mailing list and we can discuss the specific pros and cons of your case.

In some cases, implementing a Splittable DoFn might be necessary or result in better performance:

For example, if you’d like to read from a new file format that contains many records per file, or if you’d like to read from a key-value store that supports read operations in sorted key order.

I/O examples using SDFs

Java Examples

Python Examples

Using ParDo and GroupByKey

For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps:

  1. Splitting the data into parts to be read in parallel

  2. Reading from each of those parts

Each of those steps will be a ParDo, with a GroupByKey in between. The GroupByKey is an implementation detail, but for most runners GroupByKey allows the runner to use different numbers of workers in some situations:

In addition, GroupByKey also allows dynamic work rebalancing to happen on runners that support the feature.

Here are some examples of read transform implementations that use the “reading as a mini-pipeline” model when data can be read in parallel:

For data stores or files where reading cannot occur in parallel, reading is a simple task that can be accomplished with a single ParDo+GroupByKey. For example:

Sinks

To create a Beam sink, we recommend that you use a ParDo that writes the received records to the data store. To develop more complex sinks (for example, to support data de-duplication when failures are retried by a runner), use ParDo, GroupByKey, and other available Beam transforms. Many data services are optimized to write batches of elements at a time, so it may make sense to group the elements into batches before writing. Persistent connections can be initialized in a DoFn’s setUp or startBundle method rather than upon the receipt of every element as well. It should also be noted that in a large-scale, distributed system work can fail and/or be retried, so it is preferable to make the external interactions idempotent when possible.

For file-based sinks, you can use the FileBasedSink abstraction that is provided by both the Java and Python SDKs. Beam’s FileSystems utility classes can also be useful for reading and writing files. See our language specific implementation guides for more details: