apache_beam.transforms.core module

Core PTransform subclasses, such as FlatMap, GroupByKey, and Map.

class apache_beam.transforms.core.DoFn(*unused_args, **unused_kwargs)[source]

Bases: WithTypeHints, HasDisplayData, RunnerApiFn

A function object used by a transform with custom processing.

The ParDo transform is such a transform. The ParDo.apply method will take an object of type DoFn and apply it to all elements of a PCollection object.

In order to have concrete DoFn objects one has to subclass from DoFn and define the desired behavior (start_bundle/finish_bundle and process) or wrap a callable object using the CallableWrapperDoFn class.

ElementParam = ElementParam
SideInputParam = SideInputParam
TimestampParam = TimestampParam
WindowParam = WindowParam
WindowedValueParam = WindowedValueParam
PaneInfoParam = PaneInfoParam
WatermarkEstimatorParam

alias of _WatermarkEstimatorParam

BundleFinalizerParam

alias of _BundleFinalizerParam

KeyParam = KeyParam
BundleContextParam

alias of _BundleContextParam

SetupContextParam

alias of _SetupContextParam

StateParam

alias of _StateDoFnParam

TimerParam

alias of _TimerDoFnParam

DynamicTimerTagParam = DynamicTimerTagParam
DoFnProcessParams = [ElementParam, SideInputParam, TimestampParam, WindowParam, WindowedValueParam, <class 'apache_beam.transforms.core._WatermarkEstimatorParam'>, PaneInfoParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>, KeyParam, <class 'apache_beam.transforms.core._StateDoFnParam'>, <class 'apache_beam.transforms.core._TimerDoFnParam'>, <class 'apache_beam.transforms.core._BundleContextParam'>, <class 'apache_beam.transforms.core._SetupContextParam'>]
RestrictionParam

alias of _RestrictionDoFnParam

static from_callable(fn)[source]
static unbounded_per_element()[source]

A decorator on process fn specifying that the fn performs an unbounded amount of work per input element.

static yields_elements(fn)[source]

A decorator to apply to process_batch indicating it yields elements.

By default process_batch is assumed to both consume and produce “batches”, which are collections of multiple logical Beam elements. This decorator indicates that process_batch produces individual elements at a time. process_batch is always expected to consume batches.

static yields_batches(fn)[source]

A decorator to apply to process indicating it yields batches.

By default process is assumed to both consume and produce individual elements at a time. This decorator indicates that process produces “batches”, which are collections of multiple logical Beam elements.

default_label()[source]
process(element, *args, **kwargs)[source]

Method to use for processing elements.

This is invoked by DoFnRunner for each element of a input PCollection.

The following parameters can be used as default values on process arguments to indicate that a DoFn accepts the corresponding parameters. For example, a DoFn might accept the element and its timestamp with the following signature:

def process(element=DoFn.ElementParam, timestamp=DoFn.TimestampParam):
  ...

The full set of parameters is:

  • DoFn.ElementParam: element to be processed, should not be mutated.

  • DoFn.SideInputParam: a side input that may be used when processing.

  • DoFn.TimestampParam: timestamp of the input element.

  • DoFn.WindowParam: Window the input element belongs to.

  • DoFn.TimerParam: a userstate.RuntimeTimer object defined by the spec of the parameter.

  • DoFn.StateParam: a userstate.RuntimeState object defined by the spec of the parameter.

  • DoFn.KeyParam: key associated with the element.

  • DoFn.RestrictionParam: an iobase.RestrictionTracker will be provided here to allow treatment as a Splittable DoFn. The restriction tracker will be derived from the restriction provider in the parameter.

  • DoFn.WatermarkEstimatorParam: a function that can be used to track output watermark of Splittable DoFn implementations.

  • DoFn.BundleContextParam: allows a shared context manager to be used per bundle

  • DoFn.SetupContextParam: allows a shared context manager to be used per DoFn

Parameters:
  • element – The element to be processed

  • *args – side inputs

  • **kwargs – other keyword arguments.

Returns:

An Iterable of output elements or None.

process_batch(batch, *args, **kwargs)[source]
setup()[source]

Called to prepare an instance for processing bundles of elements.

This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in DoFn.teardown.

start_bundle()[source]

Called before a bundle of elements is processed on a worker.

Elements to be processed are split into bundles and distributed to workers. Before a worker calls process() on the first element of its bundle, it calls this method.

finish_bundle()[source]

Called after a bundle of elements is processed on a worker.

teardown()[source]

Called to use to clean up this instance before it is discarded.

A runner will do its best to call this method on any given instance to prevent leaks of transient resources, however, there may be situations where this is impossible (e.g. process crash, hardware failure, etc.) or unnecessary (e.g. the pipeline is shutting down and the process is about to be killed anyway, so all transient resources will be released automatically by the OS). In these cases, the call may not happen. It will also not be retried, because in such situations the DoFn instance no longer exists, so there’s no instance to retry it on.

Thus, all work that depends on input elements, and all externally important side effects, must be performed in DoFn.process or DoFn.finish_bundle.

get_function_arguments(func)[source]
default_type_hints()[source]
infer_output_type(input_type)[source]
get_input_batch_type(input_element_type) TypeConstraint | type | None[source]

Determine the batch type expected as input to process_batch.

The default implementation of get_input_batch_type simply observes the input typehint for the first parameter of process_batch. A Batched DoFn may override this method if a dynamic approach is required.

Parameters:

input_element_type – The element type of the input PCollection this DoFn is being applied to.

Returns:

None if this DoFn cannot accept batches, else a Beam typehint or a native Python typehint.

get_output_batch_type(input_element_type) TypeConstraint | type | None[source]

Determine the batch type produced by this DoFn’s process_batch implementation and/or its process implementation with @yields_batch.

The default implementation of this method observes the return type annotations on process_batch and/or process. A Batched DoFn may override this method if a dynamic approach is required.

Parameters:

input_element_type – The element type of the input PCollection this DoFn is being applied to.

Returns:

None if this DoFn will never yield batches, else a Beam typehint or a native Python typehint.

to_runner_api_parameter(context)
class apache_beam.transforms.core.CombineFn(*unused_args, **unused_kwargs)[source]

Bases: WithTypeHints, HasDisplayData, RunnerApiFn

A function object used by a Combine transform with custom processing.

A CombineFn specifies how multiple values in all or part of a PCollection can be merged into a single value—essentially providing the same kind of information as the arguments to the Python “reduce” builtin (except for the input argument, which is an instance of CombineFnProcessContext). The combining process proceeds as follows:

  1. Input values are partitioned into one or more batches.

  2. For each batch, the setup method is invoked.

  3. For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of zero values.

  4. For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.

  5. The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value, once all of the accumulators have had all the input value in their batches added to them. This operation is invoked repeatedly, until there is only one accumulator value left.

  6. The extract_output operation is invoked on the final accumulator to get the output value.

  7. The teardown method is invoked.

Note: If this CombineFn is used with a transform that has defaults, apply will be called with an empty list at expansion time to get the default value.

default_label()[source]
setup(*args, **kwargs)[source]

Called to prepare an instance for combining.

This method can be useful if there is some state that needs to be loaded before executing any of the other methods. The resources can then be disposed of in CombineFn.teardown.

If you are using Dataflow, you need to enable Dataflow Runner V2 before using this feature.

Parameters:
  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

create_accumulator(*args, **kwargs)[source]

Return a fresh, empty accumulator for the combine operation.

Parameters:
  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

add_input(mutable_accumulator, element, *args, **kwargs)[source]

Return result of folding element into accumulator.

CombineFn implementors must override add_input.

Parameters:
  • mutable_accumulator – the current accumulator, may be modified and returned for efficiency

  • element – the element to add, should not be mutated

  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

add_inputs(mutable_accumulator, elements, *args, **kwargs)[source]

Returns the result of folding each element in elements into accumulator.

This is provided in case the implementation affords more efficient bulk addition of elements. The default implementation simply loops over the inputs invoking add_input for each one.

Parameters:
  • mutable_accumulator – the current accumulator, may be modified and returned for efficiency

  • elements – the elements to add, should not be mutated

  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

merge_accumulators(accumulators, *args, **kwargs)[source]

Returns the result of merging several accumulators to a single accumulator value.

Parameters:
  • accumulators – the accumulators to merge. Only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.

  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

compact(accumulator, *args, **kwargs)[source]

Optionally returns a more compact representation of the accumulator.

This is called before an accumulator is sent across the wire, and can be useful in cases where values are buffered or otherwise lazily kept unprocessed when added to the accumulator. Should return an equivalent, though possibly modified, accumulator.

By default returns the accumulator unmodified.

Parameters:
  • accumulator – the current accumulator

  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

extract_output(accumulator, *args, **kwargs)[source]

Return result of converting accumulator into the output value.

Parameters:
  • accumulator – the final accumulator value computed by this CombineFn for the entire input key or PCollection. Can be modified for efficiency.

  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

teardown(*args, **kwargs)[source]

Called to clean up an instance before it is discarded.

If you are using Dataflow, you need to enable Dataflow Runner V2 before using this feature.

Parameters:
  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

apply(elements, *args, **kwargs)[source]

Returns result of applying this CombineFn to the input values.

Parameters:
  • elements – the set of values to combine.

  • *args – Additional arguments and side inputs.

  • **kwargs – Additional arguments and side inputs.

for_input_type(input_type)[source]

Returns a specialized implementation of self, if it exists.

Otherwise, returns self.

Parameters:

input_type – the type of input elements.

static from_callable(fn)[source]
static maybe_from_callable(fn: CombineFn | Callable, has_side_inputs: bool = True) CombineFn[source]
get_accumulator_coder()[source]
to_runner_api_parameter(context)
class apache_beam.transforms.core.PartitionFn(*unused_args, **unused_kwargs)[source]

Bases: WithTypeHints

A function object used by a Partition transform.

A PartitionFn specifies how individual values in a PCollection will be placed into separate partitions, indexed by an integer.

default_label()[source]
partition_for(element: T, num_partitions: int, *args: Any, **kwargs: Any) int[source]

Specify which partition will receive this element.

Parameters:
  • element – An element of the input PCollection.

  • num_partitions – Number of partitions, i.e., output PCollections.

  • *args – optional parameters and side inputs.

  • **kwargs – optional parameters and side inputs.

Returns:

An integer in [0, num_partitions).

class apache_beam.transforms.core.ParDo(fn, *args, **kwargs)[source]

Bases: PTransformWithSideInputs

A ParDo transform.

Processes an input PCollection by applying a DoFn to each element and returning the accumulated results into an output PCollection. The type of the elements is not fixed as long as the DoFn can deal with it. In reality the type is restrained to some extent because the elements sometimes must be persisted to external storage. See the expand() method comments for a detailed description of all possible arguments.

Note that the DoFn must return an iterable for each element of the input PCollection. An easy way to do this is to use the yield keyword in the process method.

Parameters:
  • pcoll (PCollection) – a PCollection to be processed.

  • fn (typing.Union[DoFn, typing.Callable]) – a DoFn object to be applied to each element of pcoll argument, or a Callable.

  • *args – positional arguments passed to the DoFn object.

  • **kwargs – keyword arguments passed to the DoFn object.

Note that the positional and keyword arguments will be processed in order to detect PCollection s that will be computed as side inputs to the transform. During pipeline execution whenever the DoFn object gets executed (its DoFn.process() method gets called) the PCollection arguments will be replaced by values from the PCollection in the exact positions where they appear in the argument lists.

with_exception_handling(main_tag='good', dead_letter_tag='bad', *, exc_class=<class 'Exception'>, partial=False, use_subprocess=False, threshold=1, threshold_windowing=None, timeout=None, error_handler=None, on_failure_callback: ~typing.Callable[[Exception, ~typing.Any], None] | None = None)[source]

Automatically provides a dead letter output for skipping bad records. This can allow a pipeline to continue successfully rather than fail or continuously throw errors on retry when bad elements are encountered.

This returns a tagged output with two PCollections, the first being the results of successfully processing the input PCollection, and the second being the set of bad records (those which threw exceptions during processing) along with information about the errors raised.

For example, one would write:

good, bad = Map(maybe_error_raising_function).with_exception_handling()

and good will be a PCollection of mapped records and bad will contain those that raised exceptions.

Parameters:
  • main_tag – tag to be used for the main (good) output of the DoFn, useful to avoid possible conflicts if this DoFn already produces multiple outputs. Optional, defaults to ‘good’.

  • dead_letter_tag – tag to be used for the bad records, useful to avoid possible conflicts if this DoFn already produces multiple outputs. Optional, defaults to ‘bad’.

  • exc_class – An exception class, or tuple of exception classes, to catch. Optional, defaults to ‘Exception’.

  • partial – Whether to emit outputs for an element as they’re produced (which could result in partial outputs for a ParDo or FlatMap that throws an error part way through execution) or buffer all outputs until successful processing of the entire element. Optional, defaults to False.

  • use_subprocess – Whether to execute the DoFn logic in a subprocess. This allows one to recover from errors that can crash the calling process (e.g. from an underlying C/C++ library causing a segfault), but is slower as elements and results must cross a process boundary. Note that this starts up a long-running process that is used to handle all the elements (until hard failure, which should be rare) rather than a new process per element, so the overhead should be minimal (and can be amortized if there’s any per-process or per-bundle initialization that needs to be done). Optional, defaults to False.

  • threshold – An upper bound on the ratio of records that can be bad before aborting the entire pipeline. Optional, defaults to 1.0 (meaning up to 100% of records can be bad and the pipeline will still succeed).

  • threshold_windowing – Event-time windowing to use for threshold. Optional, defaults to the windowing of the input.

  • timeout – If the element has not finished processing in timeout seconds, raise a TimeoutError. Defaults to None, meaning no time limit.

  • error_handler – An ErrorHandler that should be used to consume the bad records, rather than returning the good and bad records as a tuple.

  • on_failure_callback – If an element fails or times out, on_failure_callback will be invoked. It will receive the exception and the element being processed in as args. In case of a timeout, the exception will be of type TimeoutError. Be careful with this callback - if you set a timeout, it will not apply to the callback, and if the callback fails it will not be retried.

with_error_handler(error_handler, **exception_handling_kwargs)[source]

An alias for with_exception_handling(error_handler=error_handler, …)

This is provided to fit the general ErrorHandler conventions.

default_type_hints()[source]
infer_output_type(input_type)[source]
infer_batch_converters(input_element_type)[source]
make_fn(fn, has_side_inputs)[source]
display_data()[source]
expand(pcoll)[source]
with_outputs(*tags, main=None, allow_unknown_tags=None)[source]

Returns a tagged tuple allowing access to the outputs of a ParDo.

The resulting object supports access to the PCollection associated with a tag (e.g. o.tag, o[tag]) and iterating over the available tags (e.g. for tag in o: ...).

Parameters:
  • *tags – if non-empty, list of valid tags. If a list of valid tags is given, it will be an error to use an undeclared tag later in the pipeline.

  • **main_kw – dictionary empty or with one key 'main' defining the tag to be used for the main output (which will not have a tag associated with it).

Returns:

An object of type DoOutputsTuple that bundles together all the outputs of a ParDo transform and allows accessing the individual PCollection s for each output using an object.tag syntax.

Return type:

DoOutputsTuple

Raises:
to_runner_api_parameter(context: PipelineContext, **extra_kwargs: Any) Tuple[str, message.Message][source]
static from_runner_api_parameter(unused_ptransform, pardo_payload, context)[source]
runner_api_requires_keyed_input()[source]
get_restriction_coder()[source]

Returns restriction coder if `DoFn of this ParDo is a SDF.

Returns None otherwise.

apache_beam.transforms.core.FlatMap(fn=<function identity>, *args, **kwargs)[source]

FlatMap() is like ParDo except it takes a callable to specify the transformation.

The callable must return an iterable for each element of the input PCollection. The elements of these iterables will be flattened into the output PCollection. If no callable is given, then all elements of the input PCollection must already be iterables themselves and will be flattened into the output PCollection.

Parameters:
  • fn (callable) – a callable object.

  • *args – positional arguments passed to the transform callable.

  • **kwargs – keyword arguments passed to the transform callable.

Returns:

A PCollection containing the FlatMap() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.FlatMapTuple(fn, *args, **kwargs)[source]

FlatMapTuple() is like FlatMap() but expects tuple inputs and flattens them into multiple input arguments.

beam.FlatMapTuple(lambda a, b, …: …)

is equivalent to Python 2

beam.FlatMap(lambda (a, b, …), …: …)

In other words

beam.FlatMapTuple(fn)

is equivalent to

beam.FlatMap(lambda element, …: fn(*element, …))

This can be useful when processing a PCollection of tuples (e.g. key-value pairs).

Parameters:
  • fn (callable) – a callable object.

  • *args – positional arguments passed to the transform callable.

  • **kwargs – keyword arguments passed to the transform callable.

Returns:

A PCollection containing the FlatMapTuple() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.Map(fn, *args, **kwargs)[source]

Map() is like FlatMap() except its callable returns only a single element.

Parameters:
  • fn (callable) – a callable object.

  • *args – positional arguments passed to the transform callable.

  • **kwargs – keyword arguments passed to the transform callable.

Returns:

A PCollection containing the Map() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.MapTuple(fn, *args, **kwargs)[source]

MapTuple() is like Map() but expects tuple inputs and flattens them into multiple input arguments.

beam.MapTuple(lambda a, b, …: …)

In other words

beam.MapTuple(fn)

is equivalent to

beam.Map(lambda element, …: fn(*element, …))

This can be useful when processing a PCollection of tuples (e.g. key-value pairs).

Parameters:
  • fn (callable) – a callable object.

  • *args – positional arguments passed to the transform callable.

  • **kwargs – keyword arguments passed to the transform callable.

Returns:

A PCollection containing the MapTuple() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.Filter(fn, *args, **kwargs)[source]

Filter() is a FlatMap() with its callable filtering out elements.

Filter accepts a function that keeps elements that return True, and filters out the remaining elements.

Parameters:
  • fn (Callable[..., bool]) – a callable object. First argument will be an element.

  • *args – positional arguments passed to the transform callable.

  • **kwargs – keyword arguments passed to the transform callable.

Returns:

A PCollection containing the Filter() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

class apache_beam.transforms.core.CombineGlobally(fn, *args, **kwargs)[source]

Bases: PTransform

A CombineGlobally transform.

Reduces a PCollection to a single value by progressively applying a CombineFn to portions of the PCollection (and to intermediate values created thereby). See documentation in CombineFn for details on the specifics on how CombineFn s are applied.

Parameters:
  • pcoll (PCollection) – a PCollection to be reduced into a single value.

  • fn (callable) – a CombineFn object that will be called to progressively reduce the PCollection into single values, or a callable suitable for wrapping by CallableWrapperCombineFn.

  • *args – positional arguments passed to the CombineFn object.

  • **kwargs – keyword arguments passed to the CombineFn object.

Raises:

TypeError – If the output type of the input PCollection is not compatible with Iterable[A].

Returns:

A single-element PCollection containing the main output of the CombineGlobally transform.

Return type:

PCollection

Note that the positional and keyword arguments will be processed in order to detect PValue s that will be computed as side inputs to the transform. During pipeline execution whenever the CombineFn object gets executed (i.e. any of the CombineFn methods get called), the PValue arguments will be replaced by their actual value in the exact position where they appear in the argument lists.

has_defaults = True
as_view = False
fanout: Optional[int] = None
display_data()[source]
default_label()[source]
with_fanout(fanout)[source]
with_defaults(has_defaults=True)[source]
without_defaults()[source]
as_singleton_view()[source]
expand(pcoll)[source]
static from_runner_api_parameter(unused_ptransform, combine_payload, context)[source]
class apache_beam.transforms.core.CombinePerKey(fn: WithTypeHints, *args: Any, **kwargs: Any)[source]

Bases: PTransformWithSideInputs

A per-key Combine transform.

Identifies sets of values associated with the same key in the input PCollection, then applies a CombineFn to condense those sets to single values. See documentation in CombineFn for details on the specifics on how CombineFns are applied.

Parameters:
  • pcoll – input pcollection.

  • fn – instance of CombineFn to apply to all values under the same key in pcoll, or a callable whose signature is f(iterable, *args, **kwargs) (e.g., sum, max).

  • *args – arguments and side inputs, passed directly to the CombineFn.

  • **kwargs – arguments and side inputs, passed directly to the CombineFn.

Returns:

A PObject holding the result of the combine operation.

with_hot_key_fanout(fanout)[source]

A per-key combine operation like self but with two levels of aggregation.

If a given key is produced by too many upstream bundles, the final reduction can become a bottleneck despite partial combining being lifted pre-GroupByKey. In these cases it can be helpful to perform intermediate partial aggregations in parallel and then re-group to peform a final (per-key) combine. This is also useful for high-volume keys in streaming where combiners are not generally lifted for latency reasons.

Note that a fanout greater than 1 requires the data to be sent through two GroupByKeys, and a high fanout can also result in more shuffle data due to less per-bundle combining. Setting the fanout for a key at 1 or less places values on the “cold key” path that skip the intermediate level of aggregation.

Parameters:

fanout – either None, for no fanout, an int, for a constant-degree fanout, or a callable mapping keys to a key-specific degree of fanout.

Returns:

A per-key combining PTransform with the specified fanout.

display_data()[source]
make_fn(fn, has_side_inputs)[source]
default_label()[source]
expand(pcoll)[source]
default_type_hints()[source]
to_runner_api_parameter(context: PipelineContext) Tuple[str, beam_runner_api_pb2.CombinePayload][source]
static from_runner_api_parameter(unused_ptransform, combine_payload, context)[source]
runner_api_requires_keyed_input()[source]
class apache_beam.transforms.core.CombineValues(fn: WithTypeHints, *args: Any, **kwargs: Any)[source]

Bases: PTransformWithSideInputs

make_fn(fn, has_side_inputs)[source]
expand(pcoll)[source]
to_runner_api_parameter(context)[source]
static from_runner_api_parameter(unused_ptransform, combine_payload, context)[source]
class apache_beam.transforms.core.GroupBy(*fields, **kwargs)[source]

Bases: PTransform

Groups a PCollection by one or more expressions, used to derive the key.

GroupBy(expr) is roughly equivalent to

beam.Map(lambda v: (expr(v), v)) | beam.GroupByKey()

but provides several conveniences, e.g.

  • Several arguments may be provided, as positional or keyword arguments, resulting in a tuple-like key. For example GroupBy(a=expr1, b=expr2) groups by a key with attributes a and b computed by applying expr1 and expr2 to each element.

  • Strings can be used as a shorthand for accessing an attribute, e.g. GroupBy(‘some_field’) is equivalent to GroupBy(lambda v: getattr(v, ‘some_field’)).

The GroupBy operation can be made into an aggregating operation by invoking its aggregate_field method.

aggregate_field(field, combine_fn, dest)[source]

Returns a grouping operation that also aggregates grouped values.

Parameters:
  • field – indicates the field to be aggregated

  • combine_fn – indicates the aggregation function to be used

  • dest – indicates the name that will be used for the aggregate in the output

May be called repeatedly to aggregate multiple fields, e.g.

GroupBy(‘key’)

.aggregate_field(‘some_attr’, sum, ‘sum_attr’) .aggregate_field(lambda v: …, MeanCombineFn, ‘mean’)

force_tuple_keys(value=True)[source]

Forces the keys to always be tuple-like, even if there is only a single expression.

default_label()[source]
expand(pcoll)[source]
class apache_beam.transforms.core.GroupByKey(label: str | None = None)[source]

Bases: PTransform

A group by key transform.

Processes an input PCollection consisting of key/value pairs represented as a tuple pair. The result is a PCollection where values having a common key are grouped together. For example (a, 1), (b, 2), (a, 3) will result into (a, [1, 3]), (b, [2]).

The implementation here is used only when run on the local direct runner.

class ReifyWindows(*unused_args, **unused_kwargs)[source]

Bases: DoFn

process(element, window=WindowParam, timestamp=TimestampParam)[source]
infer_output_type(input_type)[source]
expand(pcoll)[source]
infer_output_type(input_type)[source]
to_runner_api_parameter(unused_context: PipelineContext) Tuple[str, None][source]
static from_runner_api_parameter(unused_ptransform, unused_payload, unused_context)[source]
runner_api_requires_keyed_input()[source]
class apache_beam.transforms.core.Select(*args, **kwargs)[source]

Bases: PTransform

Converts the elements of a PCollection into a schema’d PCollection of Rows.

Select(…) is roughly equivalent to Map(lambda x: Row(…)) where each argument (which may be a string or callable) of ToRow is applied to x. For example,

pcoll | beam.Select(‘a’, b=lambda x: foo(x))

is the same as

pcoll | beam.Map(lambda x: beam.Row(a=x.a, b=foo(x)))

with_exception_handling(**kwargs)[source]
default_label()[source]
expand(pcoll)[source]
infer_output_type(input_type)[source]
class apache_beam.transforms.core.Partition(fn: WithTypeHints, *args: Any, **kwargs: Any)[source]

Bases: PTransformWithSideInputs

Split a PCollection into several partitions.

Uses the specified PartitionFn to separate an input PCollection into the specified number of sub-PCollections.

When apply()d, a Partition() PTransform requires the following:

Parameters:
  • partitionfn – a PartitionFn, or a callable with the signature described in CallableWrapperPartitionFn.

  • n – number of output partitions.

The result of this PTransform is a simple list of the output PCollections representing each of n partitions, in order.

class ApplyPartitionFnFn(*unused_args, **unused_kwargs)[source]

Bases: DoFn

A DoFn that applies a PartitionFn.

process(element, partitionfn, n, *args, **kwargs)[source]
make_fn(fn, has_side_inputs)[source]
expand(pcoll)[source]
class apache_beam.transforms.core.Windowing(windowfn, triggerfn=None, accumulation_mode=None, timestamp_combiner=None, allowed_lateness=0, environment_id=None)[source]

Bases: object

Class representing the window strategy.

Parameters:
  • windowfn – Window assign function.

  • triggerfn – Trigger function.

  • accumulation_mode – a AccumulationMode, controls what to do with data when a trigger fires multiple times.

  • timestamp_combiner – a TimestampCombiner, determines how output timestamps of grouping operations are assigned.

  • allowed_lateness – Maximum delay in seconds after end of window allowed for any late data to be processed without being discarded directly.

  • environment_id – Environment where the current window_fn should be applied in.

is_default()[source]
to_runner_api(context: PipelineContext) beam_runner_api_pb2.WindowingStrategy[source]
static from_runner_api(proto, context)[source]
class apache_beam.transforms.core.WindowInto(windowfn, trigger=None, accumulation_mode=None, timestamp_combiner=None, allowed_lateness=0)[source]

Bases: ParDo

A window transform assigning windows to each element of a PCollection.

Transforms an input PCollection by applying a windowing function to each element. Each transformed element in the result will be a WindowedValue element with the same input value and timestamp, with its new set of windows determined by the windowing function.

Initializes a WindowInto transform.

Parameters:
  • windowfn (Windowing, WindowFn) – Function to be used for windowing.

  • trigger – (optional) Trigger used for windowing, or None for default.

  • accumulation_mode – (optional) Accumulation mode used for windowing, required for non-trivial triggers.

  • timestamp_combiner – (optional) Timestamp combniner used for windowing, or None for default.

class WindowIntoFn(windowing: Windowing)[source]

Bases: DoFn

A DoFn that applies a WindowInto operation.

process(element, timestamp=TimestampParam, window=WindowParam)[source]
infer_output_type(input_type)[source]
get_windowing(unused_inputs: Any) Windowing[source]
infer_output_type(input_type)[source]
expand(pcoll)[source]
to_runner_api_parameter(context: PipelineContext, **extra_kwargs: Any) Tuple[str, message.Message][source]
static from_runner_api_parameter(unused_ptransform, proto, context)[source]
class apache_beam.transforms.core.Flatten(**kwargs)[source]

Bases: PTransform

Merges several PCollections into a single PCollection.

Copies all elements in 0 or more PCollections into a single output PCollection. If there are no input PCollections, the resulting PCollection will be empty (but see also kwargs below).

Parameters:

**kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily Flatten can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information and should be considered mandatory.

expand(pcolls)[source]
infer_output_type(input_type)[source]
to_runner_api_parameter(context: PipelineContext) Tuple[str, None][source]
static from_runner_api_parameter(unused_ptransform, unused_parameter, unused_context)[source]
class apache_beam.transforms.core.Create(values, reshuffle=True)[source]

Bases: PTransform

A transform that creates a PCollection from an iterable.

Initializes a Create transform.

Parameters:

values – An object of values for the PCollection

to_runner_api_parameter(context: PipelineContext) Tuple[str, bytes][source]
infer_output_type(unused_input_type)[source]
get_output_type()[source]
expand(pbegin)[source]
as_read()[source]
get_windowing(unused_inputs: Any) Windowing[source]
class apache_beam.transforms.core.Impulse(label: str | None = None)[source]

Bases: PTransform

Impulse primitive.

expand(pbegin)[source]
get_windowing(inputs: Any) Windowing[source]
infer_output_type(unused_input_type)[source]
to_runner_api_parameter(unused_context: PipelineContext) Tuple[str, None][source]
static from_runner_api_parameter(unused_ptransform, unused_parameter, unused_context)[source]
class apache_beam.transforms.core.RestrictionProvider[source]

Bases: object

Provides methods for generating and manipulating restrictions.

This class should be implemented to support Splittable DoFn in Python SDK. See https://s.apache.org/splittable-do-fn for more details about Splittable DoFn.

To denote a DoFn class to be Splittable DoFn, DoFn.process() method of that class should have exactly one parameter whose default value is an instance of RestrictionParam. This RestrictionParam can either be constructed with an explicit RestrictionProvider, or, if no RestrictionProvider is provided, the DoFn itself must be a RestrictionProvider.

The provided RestrictionProvider instance must provide suitable overrides for the following methods: * create_tracker() * initial_restriction() * restriction_size()

Optionally, RestrictionProvider may override default implementations of following methods: * restriction_coder() * split() * split_and_size() * truncate()

** Pausing and resuming processing of an element **

As the last element produced by the iterator returned by the DoFn.process() method, a Splittable DoFn may return an object of type ProcessContinuation.

If restriction_tracker.defer_remander is called in the `DoFn.process(), it means that runner should later re-invoke DoFn.process() method to resume processing the current element and the manner in which the re-invocation should be performed.

** Updating output watermark **

DoFn.process() method of Splittable DoFn``s could contain a parameter with default value ``DoFn.WatermarkReporterParam. If specified this asks the runner to provide a function that can be used to give the runner a (best-effort) lower bound about the timestamps of future output associated with the current element processed by the DoFn. If the DoFn has multiple outputs, the watermark applies to all of them. Provided function must be invoked with a single parameter of type Timestamp or as an integer that gives the watermark in number of seconds.

create_tracker(restriction) iobase.RestrictionTracker[source]

Produces a new RestrictionTracker for the given restriction.

This API is required to be implemented.

Parameters:

restriction – an object that defines a restriction as identified by a Splittable DoFn that utilizes the current RestrictionProvider. For example, a tuple that gives a range of positions for a Splittable DoFn that reads files based on byte positions.

Returns: an object of type RestrictionTracker.

initial_restriction(element)[source]

Produces an initial restriction for the given element.

This API is required to be implemented.

split(element, restriction)[source]

Splits the given element and restriction initially.

This method enables runners to perform bulk splitting initially allowing for a rapid increase in parallelism. Note that initial split is a different concept from the split during element processing time. Please refer to iobase.RestrictionTracker.try_split for details about splitting when the current element and restriction are actively being processed.

Returns an iterator of restrictions. The total set of elements produced by reading input element for each of the returned restrictions should be the same as the total set of elements produced by reading the input element for the input restriction.

This API is optional if split_and_size has been implemented.

If this method is not override, there is no initial splitting happening on each restriction.

restriction_coder()[source]

Returns a Coder for restrictions.

Returned``Coder`` will be used for the restrictions produced by the current RestrictionProvider.

Returns:

an object of type Coder.

restriction_size(element, restriction)[source]

Returns the size of a restriction with respect to the given element.

By default, asks a newly-created restriction tracker for the default size of the restriction.

The return value must be non-negative.

Must be thread safe. Will be invoked concurrently during bundle processing due to runner initiated splitting and progress estimation.

This API is required to be implemented.

split_and_size(element, restriction)[source]

Like split, but also does sizing, returning (restriction, size) pairs.

For each pair, size must be non-negative.

This API is optional if split and restriction_size have been implemented.

truncate(element, restriction)[source]

Truncates the provided restriction into a restriction representing a finite amount of work when the pipeline is draining <https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8/edit#> for additional details about drain.>_. # pylint: disable=line-too-long By default, if the restriction is bounded then the restriction will be returned otherwise None will be returned.

This API is optional and should only be implemented if more granularity is required.

Return a truncated finite restriction if further processing is required otherwise return None to represent that no further processing of this restriction is required.

The default behavior when a pipeline is being drained is that bounded restrictions process entirely while unbounded restrictions process till a checkpoint is possible.

class apache_beam.transforms.core.WatermarkEstimatorProvider[source]

Bases: object

Provides methods for generating WatermarkEstimator.

This class should be implemented if wanting to providing output_watermark information within an SDF.

In order to make an SDF.process() access to the typical WatermarkEstimator, the SDF author should have an argument whose default value is a DoFn.WatermarkEstimatorParam instance. This DoFn.WatermarkEstimatorParam can either be constructed with an explicit WatermarkEstimatorProvider, or, if no WatermarkEstimatorProvider is provided, the DoFn itself must be a WatermarkEstimatorProvider.

initial_estimator_state(element, restriction)[source]

Returns the initial state of the WatermarkEstimator with given element and restriction. This function is called by the system.

create_watermark_estimator(estimator_state)[source]

Create a new WatermarkEstimator based on the state. The state is typically useful when resuming processing an element.

estimator_state_coder()[source]