apache_beam.dataframe.transforms module

class apache_beam.dataframe.transforms.DataframeTransform(func, proxy=None, yield_elements='schemas', include_indexes=False)[source]

Bases: PTransform

A PTransform for applying function that takes and returns dataframes to one or more PCollections.

DataframeTransform will accept a PCollection with a schema and batch it into DataFrame instances if necessary:

(pcoll | beam.Select(key=..., foo=..., bar=...)
       | DataframeTransform(lambda df: df.group_by('key').sum()))

It is also possible to process a PCollection of DataFrame instances directly, in this case a “proxy” must be provided. For example, if pcoll is a PCollection of DataFrames, one could write:

pcoll | DataframeTransform(lambda df: df.group_by('key').sum(), proxy=...)

To pass multiple PCollections, pass a tuple of PCollections wich will be passed to the callable as positional arguments, or a dictionary of PCollections, in which case they will be passed as keyword arguments.

Parameters:
  • yield_elements – (optional, default: “schemas”) If set to "pandas", return PCollection(s) containing the raw Pandas objects (DataFrame or Series as appropriate). If set to "schemas", return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.

  • include_indexes – (optional, default: False) When yield_elements="schemas", if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.

  • proxy – (optional) An empty DataFrame or Series instance with the same dtype and name as the elements of the input PCollection. Required when input PCollection DataFrame or Series elements. Ignored when input PCollection has a schema.

expand(input_pcolls)[source]