apache_beam.dataframe.transforms module¶
- class apache_beam.dataframe.transforms.DataframeTransform(func, proxy=None, yield_elements='schemas', include_indexes=False)[source]¶
Bases:
PTransform
A PTransform for applying function that takes and returns dataframes to one or more PCollections.
DataframeTransform
will accept a PCollection with a schema and batch it intoDataFrame
instances if necessary:(pcoll | beam.Select(key=..., foo=..., bar=...) | DataframeTransform(lambda df: df.group_by('key').sum()))
It is also possible to process a PCollection of
DataFrame
instances directly, in this case a “proxy” must be provided. For example, ifpcoll
is a PCollection of DataFrames, one could write:pcoll | DataframeTransform(lambda df: df.group_by('key').sum(), proxy=...)
To pass multiple PCollections, pass a tuple of PCollections wich will be passed to the callable as positional arguments, or a dictionary of PCollections, in which case they will be passed as keyword arguments.
- Parameters:
yield_elements – (optional, default: “schemas”) If set to
"pandas"
, return PCollection(s) containing the raw Pandas objects (DataFrame
orSeries
as appropriate). If set to"schemas"
, return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.include_indexes – (optional, default: False) When
yield_elements="schemas"
, ifinclude_indexes=True
, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.proxy – (optional) An empty
DataFrame
orSeries
instance with the samedtype
andname
as the elements of the input PCollection. Required when input PCollectionDataFrame
orSeries
elements. Ignored when input PCollection has a schema.