apache_beam.io.aws.s3io module

AWS S3 client

apache_beam.io.aws.s3io.parse_s3_path(s3_path, object_optional=False)[source]

Return the bucket and object names of the given s3:// path.

class apache_beam.io.aws.s3io.S3IO(client=None, options=None)[source]

Bases: object

S3 I/O client.

open(filename, mode='r', read_buffer_size=16777216, mime_type='application/octet-stream')[source]

Open an S3 file path for reading or writing.

Parameters:
  • filename (str) – S3 file path in the form s3://<bucket>/<object>.

  • mode (str) – 'r' for reading or 'w' for writing.

  • read_buffer_size (int) – Buffer size to use during read operations.

  • mime_type (str) – Mime type to set for write operations.

Returns:

S3 file object.

Raises:

ValueError – Invalid open file mode.

list_prefix(path, with_metadata=False)[source]

Lists files matching the prefix.

list_prefix has been deprecated. Use list_files instead, which returns a generator of file information instead of a dict.

Parameters:
  • path – S3 file path pattern in the form s3://<bucket>/[name].

  • with_metadata – Experimental. Specify whether returns file metadata.

Returns:

dict of file name -> size; if

with_metadata is True: dict of file name -> tuple(size, timestamp).

Return type:

If with_metadata is False

list_files(path, with_metadata=False)[source]

Lists files matching the prefix.

Parameters:
  • path – S3 file path pattern in the form s3://<bucket>/[name].

  • with_metadata – Experimental. Specify whether returns file metadata.

Returns:

generator of tuple(file name, size); if with_metadata is True: generator of tuple(file name, tuple(size, timestamp)).

Return type:

If with_metadata is False

checksum(path)[source]

Looks up the checksum of an S3 object.

Parameters:

path – S3 file path pattern in the form s3://<bucket>/<name>.

copy(src, dest)[source]

Copies a single S3 file object from src to dest.

Parameters:
  • src – S3 file path pattern in the form s3://<bucket>/<name>.

  • dest – S3 file path pattern in the form s3://<bucket>/<name>.

Raises:

TimeoutError – on timeout.

copy_paths(src_dest_pairs)[source]

Copies the given S3 objects from src to dest. This can handle directory or file paths.

Parameters:

src_dest_pairs – list of (src, dest) tuples of s3://<bucket>/<name> file paths to copy from src to dest

Returns: List of tuples of (src, dest, exception) in the same order as the

src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

copy_tree(src, dest)[source]

Renames the given S3 directory and it’s contents recursively from src to dest.

Parameters:
  • src – S3 file path pattern in the form s3://<bucket>/<name>/.

  • dest – S3 file path pattern in the form s3://<bucket>/<name>/.

Returns:

List of tuples of (src, dest, exception) where exception is None if the operation succeeded or the relevant exception if the operation failed.

delete(path)[source]

Deletes a single S3 file object from src to dest.

Parameters:
  • src – S3 file path pattern in the form s3://<bucket>/<name>/.

  • dest – S3 file path pattern in the form s3://<bucket>/<name>/.

Returns:

List of tuples of (src, dest, exception) in the same order as the src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

delete_paths(paths)[source]

Deletes the given S3 objects from src to dest. This can handle directory or file paths.

Parameters:
  • src – S3 file path pattern in the form s3://<bucket>/<name>/.

  • dest – S3 file path pattern in the form s3://<bucket>/<name>/.

Returns:

List of tuples of (src, dest, exception) in the same order as the src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

delete_files(paths, max_batch_size=1000)[source]

Deletes the given S3 file object from src to dest.

Parameters:
  • paths – List of S3 file paths in the form s3://<bucket>/<name>

  • max_batch_size – Largest number of keys to send to the client to be deleted

  • simultaneously

Returns: List of tuples of (path, exception) in the same order as the paths

argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

delete_tree(root)[source]

Deletes all objects under the given S3 directory.

Parameters:

path – S3 root path in the form s3://<bucket>/<name>/ (ending with a “/”)

Returns: List of tuples of (path, exception), where each path is an object

under the given root. exception is None if the operation succeeded or the relevant exception if the operation failed.

size(path)[source]

Returns the size of a single S3 object.

This method does not perform glob expansion. Hence the given path must be for a single S3 object.

Returns: size of the S3 object in bytes.

rename(src, dest)[source]

Renames the given S3 object from src to dest.

Parameters:
  • src – S3 file path pattern in the form s3://<bucket>/<name>.

  • dest – S3 file path pattern in the form s3://<bucket>/<name>.

last_updated(path)[source]

Returns the last updated epoch time of a single S3 object.

This method does not perform glob expansion. Hence the given path must be for a single S3 object.

Returns: last updated time of the S3 object in second.

exists(path)[source]

Returns whether the given S3 object exists.

Parameters:

path – S3 file path pattern in the form s3://<bucket>/<name>.

rename_files(src_dest_pairs)[source]

Renames the given S3 objects from src to dest.

Parameters:

src_dest_pairs – list of (src, dest) tuples of s3://<bucket>/<name> file paths to rename from src to dest

Returns: List of tuples of (src, dest, exception) in the same order as the

src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

class apache_beam.io.aws.s3io.S3Downloader(client, path, buffer_size)[source]

Bases: Downloader

property size
get_range(start, end)[source]
class apache_beam.io.aws.s3io.S3Uploader(client, path, mime_type='application/octet-stream')[source]

Bases: Uploader

put(data)[source]
finish()[source]