apache_beam.io.gcp.gcsio module

Google Cloud Storage client.

This library evolved from the Google App Engine GCS client available at https://github.com/GoogleCloudPlatform/appengine-gcs-client.

Updates to the I/O connector code

For any significant updates to this I/O connector, please consider involving corresponding code reviewers mentioned in https://github.com/apache/beam/blob/master/sdks/python/OWNERS

class apache_beam.io.gcp.gcsio.GcsIO(storage_client: Client | None = None, pipeline_options: dict | PipelineOptions | None = None)[source]

Bases: object

Google Cloud Storage I/O client.

get_project_number(bucket)[source]
get_bucket(bucket_name, **kwargs)[source]

Returns an object bucket from its name, or None if it does not exist.

create_bucket(bucket_name, project, kms_key=None, location=None, soft_delete_retention_duration_seconds=0)[source]

Create and return a GCS bucket in a specific project.

open(filename, mode='r', read_buffer_size=16777216, mime_type='application/octet-stream')[source]

Open a GCS file path for reading or writing.

Parameters:
  • filename (str) – GCS file path in the form gs://<bucket>/<object>.

  • mode (str) – 'r' for reading or 'w' for writing.

  • read_buffer_size (int) – Buffer size to use during read operations.

  • mime_type (str) – Mime type to set for write operations.

Returns:

GCS file object.

Raises:

ValueError – Invalid open file mode.

delete(path)[source]

Deletes the object at the given GCS path.

Parameters:

path – GCS file path pattern in the form gs://<bucket>/<name>.

delete_batch(paths)[source]

Deletes the objects at the given GCS paths. Warning: any exception during batch delete will NOT be retried.

Parameters:

paths – List of GCS file path patterns or Dict with GCS file path patterns as keys. The patterns are in the form gs://<bucket>/<name>, but not to exceed MAX_BATCH_OPERATION_SIZE in length.

Returns: List of tuples of (path, exception) in the same order as the

paths argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

copy(src, dest)[source]

Copies the given GCS object from src to dest.

Parameters:
  • src – GCS file path pattern in the form gs://<bucket>/<name>.

  • dest – GCS file path pattern in the form gs://<bucket>/<name>.

Raises:

Any exceptions during copying

copy_batch(src_dest_pairs)[source]

Copies the given GCS objects from src to dest. Warning: any exception during batch copy will NOT be retried.

Parameters:

src_dest_pairs – list of (src, dest) tuples of gs://<bucket>/<name> files paths to copy from src to dest, not to exceed MAX_BATCH_OPERATION_SIZE in length.

Returns: List of tuples of (src, dest, exception) in the same order as the

src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.

copytree(src, dest)[source]

Renames the given GCS “directory” recursively from src to dest.

Parameters:
  • src – GCS file path pattern in the form gs://<bucket>/<name>/.

  • dest – GCS file path pattern in the form gs://<bucket>/<name>/.

rename(src, dest)[source]

Renames the given GCS object from src to dest.

Parameters:
  • src – GCS file path pattern in the form gs://<bucket>/<name>.

  • dest – GCS file path pattern in the form gs://<bucket>/<name>.

exists(path)[source]

Returns whether the given GCS object exists.

Parameters:

path – GCS file path pattern in the form gs://<bucket>/<name>.

checksum(path)[source]

Looks up the checksum of a GCS object.

Parameters:

path – GCS file path pattern in the form gs://<bucket>/<name>.

size(path)[source]

Returns the size of a single GCS object.

This method does not perform glob expansion. Hence the given path must be for a single GCS object.

Returns: size of the GCS object in bytes.

kms_key(path)[source]

Returns the KMS key of a single GCS object.

This method does not perform glob expansion. Hence the given path must be for a single GCS object.

Returns: KMS key name of the GCS object as a string, or None if it doesn’t

have one.

last_updated(path)[source]

Returns the last updated epoch time of a single GCS object.

This method does not perform glob expansion. Hence the given path must be for a single GCS object.

Returns: last updated time of the GCS object in second.

list_prefix(path, with_metadata=False)[source]

Lists files matching the prefix.

list_prefix has been deprecated. Use list_files instead, which returns a generator of file information instead of a dict.

Parameters:
  • path – GCS file path pattern in the form gs://<bucket>/[name].

  • with_metadata – Experimental. Specify whether returns file metadata.

Returns:

dict of file name -> size; if

with_metadata is True: dict of file name -> tuple(size, timestamp).

Return type:

If with_metadata is False

list_files(path, with_metadata=False)[source]

Lists files matching the prefix.

Parameters:
  • path – GCS file path pattern in the form gs://<bucket>/[name].

  • with_metadata – Experimental. Specify whether returns file metadata.

Returns:

generator of tuple(file name, size); if with_metadata is True: generator of tuple(file name, tuple(size, timestamp)).

Return type:

If with_metadata is False

is_soft_delete_enabled(gcs_path)[source]
apache_beam.io.gcp.gcsio.create_storage_client(pipeline_options, use_credentials=True)[source]

Create a GCS client for Beam via GCS Client Library.

Parameters:
Returns:

A google.cloud.storage.client.Client instance.