apache_beam.transforms.enrichment_handlers.bigquery module¶
- class apache_beam.transforms.enrichment_handlers.bigquery.BigQueryEnrichmentHandler(project: str, *, table_name: str = '', row_restriction_template: str = '', fields: List[str] | None = None, column_names: List[str] | None = None, condition_value_fn: Callable[[Row], List[Any]] | None = None, query_fn: Callable[[Row], str] | None = None, min_batch_size: int = 1, max_batch_size: int = 10000, **kwargs)[source]¶
Bases:
EnrichmentSourceHandler
[Row
|List
[Row
],Row
|List
[Row
]]Enrichment handler for Google Cloud BigQuery.
Use this handler with
apache_beam.transforms.enrichment.Enrichment
transform.- To use this handler you need either of the following combinations:
table_name, row_restriction_template, fields
table_name, row_restriction_template, condition_value_fn
query_fn
By default, the handler pulls all columns from the BigQuery table. To override this, use the column_name parameter to specify a list of column names to fetch.
This handler pulls data from BigQuery per element by default. To change this behavior, set the min_batch_size and max_batch_size parameters. These min and max values for batch size are sent to the
apache_beam.transforms.utils.BatchElements
transform.NOTE: Elements cannot be batched when using the query_fn parameter.
- Example Usage:
- handler = BigQueryEnrichmentHandler(project=project_name,
row_restriction=”id=’{}’”, table_name=’project.dataset.table’, fields=fields, min_batch_size=2, max_batch_size=100)
- Parameters:
project – Google Cloud project ID for the BigQuery table.
table_name (str) – Fully qualified BigQuery table name in the format project.dataset.table.
row_restriction_template (str) – A template string for the WHERE clause in the BigQuery query with placeholders ({}) to dynamically filter rows based on input data.
fields – (Optional[List[str]]) List of field names present in the input beam.Row. These are used to construct the WHERE clause (if condition_value_fn is not provided).
column_names – (Optional[List[str]]) Names of columns to select from the BigQuery table. If not provided, all columns (*) are selected.
condition_value_fn – (Optional[Callable[[beam.Row], Any]]) A function that takes a beam.Row and returns a list of value to populate in the placeholder {} of WHERE clause in the query.
query_fn – (Optional[Callable[[beam.Row], str]]) A function that takes a beam.Row and returns a complete BigQuery SQL query string.
min_batch_size (int) – Minimum number of rows to batch together when querying BigQuery. Defaults to 1 if query_fn is not specified.
max_batch_size (int) – Maximum number of rows to batch together. Defaults to 10,000 if query_fn is not specified.
**kwargs – Additional keyword arguments to pass to bigquery.Client.
Note
min_batch_size and max_batch_size cannot be defined if the query_fn is provided.
Either fields or condition_value_fn must be provided for query construction if query_fn is not provided.
Ensure appropriate permissions are granted for BigQuery access.