apache_beam.io.avroio module
PTransforms
for reading from and writing to Avro files.
Provides two read PTransform``s, ``ReadFromAvro
and ReadAllFromAvro
,
that produces a PCollection
of records.
Each record of this PCollection
will contain a single record read from
an Avro file. Records that are of simple types will be mapped into
corresponding Python types. Records that are of Avro type ‘RECORD’ will be
mapped to Python dictionaries that comply with the schema contained in the
Avro file that contains those records. In this case, keys of each dictionary
will contain the corresponding field names and will be of type string
while the values of the dictionary will be of the type defined in the
corresponding Avro schema.
For example, if schema of the Avro file is the following. {“namespace”: “example.avro”,”type”: “record”,”name”: “User”,”fields”: [{“name”: “name”, “type”: “string”}, {“name”: “favorite_number”, “type”: [“int”, “null”]}, {“name”: “favorite_color”, “type”: [“string”, “null”]}]}
Then records generated by read transforms will be dictionaries of the following form. {‘name’: ‘Alyssa’, ‘favorite_number’: 256, ‘favorite_color’: None}).
Additionally, this module provides a write PTransform
WriteToAvro
that can be used to write a given PCollection
of Python objects to an
Avro file.
- class apache_beam.io.avroio.ReadFromAvro(file_pattern=None, min_bundle_size=0, validate=True, use_fastavro=True, as_rows=False)[source]
Bases:
PTransform
A PTransform for reading records from avro files.
Each record of the resulting PCollection will contain a single record read from a source. Records that are of simple types will be mapped to beam Rows with a single record field containing the records value. Records that are of Avro type
RECORD
will be mapped to Beam rows that comply with the schema contained in the Avro file that contains those records.Initializes
ReadFromAvro
.Uses source
_AvroSource
to read a set of Avro files defined by a given file pattern.If
/mypath/myavrofiles*
is a file-pattern that points to a set of Avro files, aPCollection
for the records in these Avro files can be created in the following manner.with beam.Pipeline() as p: records = p | 'Read' >> beam.io.ReadFromAvro('/mypath/myavrofiles*')
Each record of this
PCollection
will contain a single record read from a source. Records that are of simple types will be mapped into corresponding Python types. Records that are of Avro typeRECORD
will be mapped to Python dictionaries that comply with the schema contained in the Avro file that contains those records. In this case, keys of each dictionary will contain the corresponding field names and will be of typestr
while the values of the dictionary will be of the type defined in the corresponding Avro schema.For example, if schema of the Avro file is the following.
{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Then records generated by
_AvroSource
will be dictionaries of the following form.{'name': 'Alyssa', 'favorite_number': 256, 'favorite_color': None}).
- Parameters:
file_pattern (str) – the file glob to read
min_bundle_size (int) – the minimum size in bytes, to be considered when splitting the input into bundles.
validate (bool) – flag to verify that the files exist during the pipeline creation time.
use_fastavro (bool) – This flag is left for API backwards compatibility and no longer has an effect. Do not use.
as_rows (bool) – Whether to return a schema’d PCollection of Beam rows.
- class apache_beam.io.avroio.ReadAllFromAvro(min_bundle_size=0, desired_bundle_size=67108864, use_fastavro=True, with_filename=False, label='ReadAllFiles')[source]
Bases:
PTransform
A
PTransform
for readingPCollection
of Avro files.Uses source ‘_AvroSource’ to read a
PCollection
of Avro files or file patterns and produce aPCollection
of Avro records.This implementation is only tested with batch pipeline. In streaming, reading may happen with delay due to the limitation in ReShuffle involved.
Initializes
ReadAllFromAvro
.- Parameters:
min_bundle_size – the minimum size in bytes, to be considered when splitting the input into bundles.
desired_bundle_size – the desired size in bytes, to be considered when splitting the input into bundles.
use_fastavro (bool) – This flag is left for API backwards compatibility and no longer has an effect. Do not use.
with_filename – If True, returns a Key Value with the key being the file name and the value being the actual data. If False, it only returns the data.
- DEFAULT_DESIRED_BUNDLE_SIZE = 67108864
- class apache_beam.io.avroio.ReadAllFromAvroContinuously(file_pattern, label='ReadAllFilesContinuously', **kwargs)[source]
Bases:
ReadAllFromAvro
A
PTransform
for reading avro files in given file patterns. This PTransform acts as a Source and produces continuously aPCollection
of Avro records.For more details, see
ReadAllFromAvro
for avro parsing settings; seeapache_beam.io.fileio.MatchContinuously
for watching settings.ReadAllFromAvroContinuously is experimental. No backwards-compatibility guarantees. Due to the limitation on Reshuffle, current implementation does not scale.
Initialize the
ReadAllFromAvroContinuously
transform.Accepts args for constructor args of both
ReadAllFromAvro
andMatchContinuously
.
- class apache_beam.io.avroio.WriteToAvro(file_path_prefix, schema=None, codec='deflate', file_name_suffix='', num_shards=0, shard_name_template=None, mime_type='application/x-avro', use_fastavro=True)[source]
Bases:
PTransform
A
PTransform
for writing avro files.If the input has a schema, a corresponding avro schema will be automatically generated and used to write the output records.
Initialize a WriteToAvro transform.
- Parameters:
file_path_prefix – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix. In most cases, only this argument is specified and num_shards, shard_name_template, and file_name_suffix use default values.
schema – The schema to use (dict).
codec – The codec to use for block-level compression. Any string supported by the Avro specification is accepted (for example ‘null’).
file_name_suffix – Suffix for the files written.
num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. Constraining the number of shards is likely to reduce the performance of a pipeline. Setting this value is not recommended unless you require a specific number of output files.
shard_name_template – A template string containing placeholders for the shard number and shard count. When constructing a filename for a particular shard number, the upper-case letters ‘S’ and ‘N’ are replaced with the 0-padded shard number and shard count respectively. This argument can be ‘’ in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is ‘-SSSSS-of-NNNNN’ if None is passed as the shard_name_template.
mime_type – The MIME type to use for the produced files, if the filesystem supports specifying MIME types.
use_fastavro (bool) – This flag is left for API backwards compatibility and no longer has an effect. Do not use.
- Returns:
A WriteToAvro transform usable for writing.