apache_beam.ml.transforms.base module

class apache_beam.ml.transforms.base.MLTransform(*, write_artifact_location: str | None = None, read_artifact_location: str | None = None, transforms: list[MLTransformProvider] | None = None)[source]

Bases: PTransform[PCollection[ExampleT], PCollection[MLTransformOutputT] | tuple[PCollection[MLTransformOutputT], PCollection[Row]]], Generic[ExampleT, MLTransformOutputT]

MLTransform is a Beam PTransform that can be used to apply transformations to the data. MLTransform is used to wrap the data processing transforms provided by Apache Beam. MLTransform works in two modes: write and read. In the write mode, MLTransform will apply the transforms to the data and store the artifacts in the write_artifact_location. In the read mode, MLTransform will read the artifacts from the read_artifact_location and apply the transforms to the data. The artifact location should be a valid storage path where the artifacts can be written to or read from.

Note that when consuming artifacts, it is not necessary to pass the transforms since they are inherently stored within the artifacts themselves.

Parameters:
  • write_artifact_location – A storage location for artifacts resulting from MLTransform. These artifacts include transformations applied to the dataset and generated values like min, max from ScaleTo01, and mean, var from ScaleToZScore. Artifacts are produced and written to this location when using write_artifact_mode. Later MLTransforms can reuse produced artifacts by setting read_artifact_mode instead of write_artifact_mode. The value assigned to write_artifact_location should be a valid storage directory that the artifacts from this transform can be written to. If no directory exists at this location, one will be created. This will overwrite any artifacts already in this location, so distinct locations should be used for each instance of MLTransform. Only one of write_artifact_location and read_artifact_location should be specified.

  • read_artifact_location – A storage location to read artifacts resulting froma previous MLTransform. These artifacts include transformations applied to the dataset and generated values like min, max from ScaleTo01, and mean, var from ScaleToZScore. Note that when consuming artifacts, it is not necessary to pass the transforms since they are inherently stored within the artifacts themselves. The value assigned to read_artifact_location should be a valid storage path where the artifacts can be read from. Only one of write_artifact_location and read_artifact_location should be specified.

  • transforms – A list of transforms to apply to the data. All the transforms are applied in the order they are specified. The input of the i-th transform is the output of the (i-1)-th transform. Multi-input transforms are not supported yet.

expand(pcoll: PCollection[ExampleT]) PCollection[MLTransformOutputT] | tuple[PCollection[MLTransformOutputT], PCollection[Row]][source]

This is the entrypoint for the MLTransform. This method will invoke the process_data() method of the ProcessHandler instance to process the incoming data.

process_data takes in a PCollection and applies the PTransforms necessary to process the data and returns a PCollection of transformed data. :param pcoll: A PCollection of ExampleT type.

Returns:

A PCollection of MLTransformOutputT type

with_transform(transform: MLTransformProvider)[source]

Add a transform to the MLTransform pipeline. :param transform: A BaseOperation instance.

Returns:

A MLTransform instance.

with_exception_handling(*, exc_class=<class 'Exception'>, use_subprocess=False, threshold=1)[source]
class apache_beam.ml.transforms.base.ProcessHandler(label: str | None = None)[source]

Bases: PTransform[PCollection[ExampleT], PCollection[MLTransformOutputT] | tuple[PCollection[MLTransformOutputT], PCollection[Row]]], ABC

Only for internal use. No backwards compatibility guarantees.

abstract append_transform(transform: BaseOperation)[source]

Append transforms to the ProcessHandler.

class apache_beam.ml.transforms.base.MLTransformProvider[source]

Bases: object

Data processing transforms that are intended to be used with MLTransform should subclass MLTransformProvider and implement get_ptransform_for_processing().

get_ptransform_for_processing() method should return a PTransform that can be used to process the data.

abstract get_ptransform_for_processing(**kwargs) PTransform[source]

Returns a PTransform that can be used to process the data.

get_counter()[source]

Returns the counter name for the data processing transform.

class apache_beam.ml.transforms.base.BaseOperation(columns: list[str])[source]

Bases: Generic[OperationInputT, OperationOutputT], MLTransformProvider, ABC

Base Opertation class data processing transformations. :param columns: List of column names to apply the transformation.

abstract apply_transform(data: OperationInputT, output_column_name: str) dict[str, OperationOutputT][source]

Define any processing logic in the apply_transform() method. processing logics are applied on inputs and returns a transformed output. :param inputs: input data.

class apache_beam.ml.transforms.base.EmbeddingsManager(columns: list[str], *, load_model_args: dict[str, Any] | None = None, min_batch_size: int | None = None, max_batch_size: int | None = None, large_model: bool = False, **kwargs)[source]

Bases: MLTransformProvider

abstract get_model_handler() ModelHandler[source]

Return framework specific model handler.

get_columns_to_apply()[source]