apache_beam.io.filebasedsource module
A framework for developing sources for new file types.
To create a source for a new file type a sub-class of FileBasedSource
should be created. Sub-classes of FileBasedSource
must implement the
method FileBasedSource.read_records()
. Please read the documentation of
that method for more details.
For an example implementation of FileBasedSource
see
_AvroSource
.
- class apache_beam.io.filebasedsource.FileBasedSource(file_pattern, min_bundle_size=0, compression_type='auto', splittable=True, validate=True)[source]
Bases:
BoundedSource
A
BoundedSource
for reading a file glob of a given type.Initializes
FileBasedSource
.- Parameters:
file_pattern (str) – the file glob to read a string or a
ValueProvider
(placeholder to inject a runtime value).min_bundle_size (int) – minimum size of bundles that should be generated when performing initial splitting on this source.
compression_type (str) – Used to handle compressed output files. Typical value is
CompressionTypes.AUTO
, in which case the final file path’s extension will be used to detect the compression.splittable (bool) – whether
FileBasedSource
should try to logically split a single file into data ranges so that different parts of the same file can be read in parallel. If set toFalse
,FileBasedSource
will prevent both initial and dynamic splitting of sources for single files. File patterns that represent multiple files may still get split into sources for individual files. Even if set toTrue
by the user,FileBasedSource
may choose to not split the file, for example, for compressed files where currently it is not possible to efficiently read a data range without decompressing the whole file.validate (bool) – Boolean flag to verify that the files exist during the pipeline creation time.
- Raises:
TypeError – when compression_type is not valid or if file_pattern is not a
str
or aValueProvider
.ValueError – when compression and splittable files are specified.
IOError – when the file pattern specified yields an empty result.
- MIN_NUMBER_OF_FILES_TO_STAT = 100
- MIN_FRACTION_OF_FILES_TO_STAT = 0.01
- read_records(file_name, offset_range_tracker)[source]
Returns a generator of records created by reading file ‘file_name’.
- Parameters:
file_name – a
string
that gives the name of the file to be read. MethodFileBasedSource.open_file()
must be used to open the file and create a seekable file object.offset_range_tracker – a object of type
OffsetRangeTracker
. This defines the byte range of the file that should be read. See documentation iniobase.BoundedSource.read()
for more information on reading records while complying to the range defined by a givenRangeTracker
.
- Returns:
an iterator that gives the records read from the given file.
- property splittable