apache_beam.io.hadoopfilesystem module
FileSystem
implementation for accessing
Hadoop Distributed File System files.
- class apache_beam.io.hadoopfilesystem.HadoopFileSystem(pipeline_options)[source]
Bases:
FileSystem
FileSystem
implementation that supports HDFS.URL arguments to methods expect strings starting with
hdfs://
.Initializes a connection to HDFS.
Connection configuration is done by passing pipeline options. See
HadoopFileSystemOptions
.- join(base_url, *paths)[source]
Join two or more pathname components.
- Parameters:
base_url – string path of the first component of the path. Must start with hdfs://.
paths – path components to be added
- Returns:
Full url after combining all the passed components.
- create(url, mime_type='application/octet-stream', compression_type='auto') BinaryIO [source]
- Returns:
A Python File-like object.
- open(url, mime_type='application/octet-stream', compression_type='auto') BinaryIO [source]
- Returns:
A Python File-like object.
- copy(source_file_names, destination_file_names)[source]
It is an error if any file to copy already exists at the destination.
Raises
BeamIOError
if any error occurred.- Parameters:
source_file_names – iterable of URLs.
destination_file_names – iterable of URLs.
- exists(url: str) bool [source]
Checks existence of url in HDFS.
- Parameters:
url – String in the form hdfs://…
- Returns:
True if url exists as a file or directory in HDFS.
- size(url)[source]
Fetches file size for a URL.
- Returns:
int size of path according to the FileSystem.
- Raises:
BeamIOError – if url doesn’t exist.
- last_updated(url)[source]
Fetches last updated time for a URL.
- Parameters:
url – string url of file.
Returns: float UNIX Epoch time
- Raises:
BeamIOError – if path doesn’t exist.
- checksum(url)[source]
Fetches a checksum description for a URL.
- Returns:
String describing the checksum.
- Raises:
BeamIOError – if url doesn’t exist.