Apache Beam Typescript SDK
The Typescript SDK for Apache Beam provides a simple, powerful API for building batch and streaming data processing pipelines.
Get started with the Typescript SDK
Get started with the Beam Typescript SDK quickstart to set up your development environment, get the Beam SDK for Typescript, and run an example pipeline. Then, read through the Beam programming guide to learn the basic concepts that apply to all SDKs in Beam.
Overview
We generally try to apply the concepts from the Beam API in a TypeScript idiomatic way. In addition, some notable departures are taken from the traditional SDKs:
We take a “relational foundations” approach, where schema’d data is the primary way to interact with data, and we generally eschew the key-value requiring transforms in favor of a more flexible approach naming fields or expressions. For example, we favor the more flexible GroupBy PTransform over the traditional GroupByKey. JavaScript’s native Object is used as the row type.
As part of being schema-first we also de-emphasize Coders as a first-class concept in the SDK, relegating it to an advanced feature used for interop. Though we can infer schemas from individual elements, it is still TBD to figure out if/how we can leverage the type system and/or function introspection to regularly infer schemas at construction time. A fallback coder using BSON encoding is used when we don’t have sufficient type information.
We have added additional methods to the PCollection object, notably
map
andflatmap
, rather than only allowing apply. In addition,apply
can accept a function argument(PCollection) => ...
as well as a PTransform subclass, which treats this callable as if it were a PTransform’s expand.In the other direction, we have eliminated the problematic Pipeline object from the API, instead providing a
Root
PValue on which pipelines are built, and invoking run() on a Runner. We offer a less error-proneRunner.run
which finishes only when the pipeline is completely finished as well asRunner.runAsync
which returns a handle to the running pipeline.Rather than introduce PCollectionTuple, PCollectionList, etc. we let PValue literally be an array or object with PValue values which transforms can consume or produce. These are applied by wrapping them with the
P
operator, e.g.P([pc1, pc2, pc3]).apply(new Flatten())
.Like Python,
flatMap
andParDo.process
return multiple elements by yielding them from a generator, rather than invoking a passed-in callback. There is currently an operation to split a PCollection into multiple PCollections based on the properties of the elements, and we may consider using a callback for side outputs.The
map
,flatMap
, andParDo.process
methods take an additional (optional) context argument, which is similar to the keyword arguments used in Python. These are javascript objects whose members may be constants (which are passed as is) or special DoFnParam objects which provide getters to element-specific information (such as the current timestamp, window, or side input) at runtime.Rather than introduce multiple-output complexity into the map/do operations themselves, producing multiple outputs is done by following with a new
Split
primitive that takes aPCollection<{a?: AType, b: BType, ... }>
and produces an object{a: PCollection<AType>, b: PCollection<BType>, ...}
.JavaScript supports (and encourages) an asynchronous programing model, with many libraries requiring use of the async/await paradigm. As there is no way (by design) to go from the asynchronous style back to the synchronous style, this needs to be taken into account when designing the API. We currently offer asynchronous variants of
PValue.apply(...)
(in addition to the synchronous ones, as they are easier to chain) as well as makingRunner.run
asynchronous. TBD to do this for all user callbacks as well.
An example pipeline can be found at wordcount.ts and more documentation can be found in the beam programming guide.
Pipeline I/O
See the Beam-provided I/O Transforms page for a list of the currently available I/O transforms.
Supported Features
The Typescript SDK is still under development but already supports many, but not all, features currently supported by the Beam model, both batch and streaming. It also has extensive support for cross-language transforms which can be leveraged to use more advanced features from Typescript pipelines.
Serialization
As Beam is designed to run in a distributed environment, all functions and data are required to be serializable.
By default, data is serialized using a BSON encoding, though this can be customized by applying the withRowCoder or withCoderInternal transforms to a PCollection.
Functions that are used in transforms (such as map
), including closures and their captured data,
are serialized via ts-serialize-closures.
While this handles most cases well, it still has limitations and can capture, and in its
walk of the transitive closure of referenced objects may capture objects that are better
imported rather than serialized.
To avoid these limitations one, one can explicitly register references with the
requireForSerialization
function as follows.
// in module my_package/module_to_be_required
import { requireForSerialization } from "apache-beam/serialization";
// define or import various objects_to_register here
requireForSerialization(
"my_package/module_to_be_required", { objects_to_register });
The starter project has such an example.