apache_beam.ml.gcp.naturallanguageml module

class apache_beam.ml.gcp.naturallanguageml.Document(content: str, type: str | Type = 'PLAIN_TEXT', language_hint: str | None = None, encoding: str | None = 'UTF8', from_gcs: bool = False)[source]

Bases: object

Represents the input to AnnotateText transform.

Parameters:

content (str) – The content of the input or the Google Cloud Storage URI where the file is stored.
type (Union[str, google.cloud.language_v1.Document.Type]) – Text type. Possible values are HTML, PLAIN_TEXT. The default value is PLAIN_TEXT.
language_hint (Optional[str]) – The language of the text. If not specified, language will be automatically detected. Values should conform to ISO-639-1 standard.
encoding (Optional[str]) – Text encoding. Possible values are: NONE, UTF8, UTF16, UTF32. The default value is UTF8.
from_gcs (bool) – Whether the content should be interpret as a Google Cloud Storage URI. The default value is False.

static to_dict(document: Document) → Mapping[str, str | None][source]

apache_beam.ml.gcp.naturallanguageml.AnnotateText(features: Mapping[str, bool] | Features, timeout: float | None = None, metadata: Sequence[Tuple[str, str]] | None = None)[source]

A PTransform for annotating text using the Google Cloud Natural Language API: https://cloud.google.com/natural-language/docs.

Parameters:

pcoll (PCollection) – An input PCollection of Document objects.
features (Union[Mapping[str, bool], types.AnnotateTextRequest.Features]) – A dictionary of natural language operations to be performed on given text in the following format:: {‘extact_syntax’=True, ‘extract_entities’=True}
timeout (Optional[float]) – The amount of time, in seconds, to wait for the request to complete. The timeout applies to each individual retry attempt.
metadata (Optional[Sequence[Tuple[str, str]]]) – Additional metadata that is provided to the method.