apache_beam.ml.rag.enrichment.bigquery_vector_search module

class apache_beam.ml.rag.enrichment.bigquery_vector_search.BigQueryVectorSearchParameters(project: str, table_name: str, embedding_column: str, columns: list[str], neighbor_count: int, metadata_restriction_template: str | Callable[[EmbeddableItem], str] | None = None, distance_type: str | None = None, options: dict[str, Any] | None = None, include_distance: bool = False)[source]

Bases: object

Parameters for configuring BigQuery vector similarity search.

This class is used by BigQueryVectorSearchEnrichmentHandler to perform vector similarity search using BigQuery’s VECTOR_SEARCH function. It processes EmbeddableItem objects that contain Embedding and returns similar vectors from a BigQuery table.

BigQueryVectorSearchEnrichmentHandler is used with Enrichment transform to enrich EmbeddableItems with similar content from a vector database. For example:

>>> # Create search parameters
>>> params = BigQueryVectorSearchParameters(
...     table_name='project.dataset.embeddings',
...     embedding_column='embedding',
...     columns=['content'],
...     neighbor_count=5
... )
>>> # Use in pipeline
>>> enriched = (
...     embeddable_items
...     | "Generate Embeddings" >> MLTransform(...)
...     | "Find Similar" >> Enrichment(
...         BigQueryVectorSearchEnrichmentHandler(
...             project='my-project',
...             vector_search_parameters=params
...         )
...     )
... )

BigQueryVectorSearchParameters encapsulates the configuration needed to perform vector similarity search using BigQuery’s VECTOR_SEARCH function. It handles formatting the query with proper embedding vectors and metadata restrictions.

Example with flattened metadata column:

Table schema:

embedding: ARRAY<FLOAT64>  # Vector embedding
content: STRING           # Document content
language: STRING          # Direct metadata column

Code:

>>> params = BigQueryVectorSearchParameters(
...     table_name='project.dataset.embeddings',
...     embedding_column='embedding',
...     columns=['content', 'language'],
...     neighbor_count=5,
...     # For column 'language', value comes from
...     # embeddable_item.metadata['language']
...     metadata_restriction_template="language = '{language}'"
... )
>>> # When processing a embeddable_item with
>>> # metadata={'language': 'en'}, generates: WHERE language = 'en'

Example with nested repeated metadata:

Table schema:

embedding: ARRAY<FLOAT64>  # Vector embedding
content: STRING           # Document content
metadata: ARRAY<STRUCT>   # Nested repeated metadata
  key: STRING,
  value: STRING
>>

Code:

>>> params = BigQueryVectorSearchParameters(
...     table_name='project.dataset.embeddings',
...     embedding_column='embedding',
...     columns=['content', 'metadata'],
...     neighbor_count=5,
...     # check_metadata(field_name, key_to_search,
...     # value_from_embeddable_item)
...     metadata_restriction_template=(
...         "check_metadata(metadata, 'language', '{language}')"
...     )
... )
>>> # When processing a embeddable_item with
>>> # metadata={'language': 'en'},
>>> # generates: WHERE check_metadata(metadata, 'language', 'en')
>>> # Searches for {key: 'language', value: 'en'} in metadata array

Parameters:

project – GCP project ID containing the BigQuery dataset
table_name – Fully qualified BigQuery table name containing vectors.
embedding_column – Column name containing the embedding vectors.
columns – List of columns to retrieve from matched vectors.
neighbor_count – Number of similar vectors to return (top-k).
metadata_restriction_template –
Template string or callable for filtering vectors. Template string supports two formats:
1. For flattened metadata columns: column_name = '{metadata_key}' where column_name is the BigQuery column and metadata_key is used to get the value from embeddable_item.metadata[metadata_key].
2. For nested repeated metadata (ARRAY<STRUCT<key,value>>): check_metadata(field_name, 'key_to_match', '{metadata_key}') where field_name is the ARRAY<STRUCT> column in BigQuery, key_to_match is the literal key to search for in the array, and metadata_key is used to get value from embeddable_item.metadata[metadata_key].
Multiple conditions can be combined using AND/OR operators. For example:
```
>>> # Combine metadata check with column filter
>>> template = (
...     "check_metadata(metadata, 'language', '{language}') "
...     "AND source = '{source}'"
... )
>>> # When embeddable_item.metadata = {'language': 'en',
>>> # 'source': 'web'}
>>> # Generates: WHERE
>>> #             check_metadata(metadata, 'language', 'en')
>>> #           AND source = 'web'
```
distance_type – Optional distance metric to use. Supported values: COSINE (default), EUCLIDEAN, or DOT_PRODUCT.
options – Optional dictionary of additional VECTOR_SEARCH options.
include_distance – Reurns the vector search similarity score if True.

project: str

table_name: str

embedding_column: str

columns: list[str]

neighbor_count: int

metadata_restriction_template: str | Callable[[EmbeddableItem], str] | None = None

distance_type: str | None = None

options: dict[str, Any] | None = None

include_distance: bool = False

format_query(items: list[EmbeddableItem]) → str[source]: Format the vector search query template.

class apache_beam.ml.rag.enrichment.bigquery_vector_search.BigQueryVectorSearchEnrichmentHandler(vector_search_parameters: BigQueryVectorSearchParameters, *, min_batch_size: int = 1, max_batch_size: int = 1000, log_query=False, **kwargs)[source]

Bases: EnrichmentSourceHandler[EmbeddableItem | list[EmbeddableItem], list[tuple[EmbeddableItem, dict[str, Any]]]]

Enrichment handler that performs vector similarity search using BigQuery.

This handler enriches EmbeddableItems by finding similar vectors in a BigQuery table using the VECTOR_SEARCH function. It supports batching requests for efficiency and preserves the original metadata while adding the search results.

Example

>>> from apache_beam.ml.rag.types import EmbeddableItem
>>> from apache_beam.ml.rag.types import Content, Embedding
>>>
>>> # Configure vector search
>>> params = BigQueryVectorSearchParameters(
...     table_name='project.dataset.embeddings',
...     embedding_column='embedding',
...     columns=['content', 'metadata'],
...     neighbor_count=2,
...     metadata_restriction_template="language = '{language}'"
... )
>>>
>>> # Create handler
>>> handler = BigQueryVectorSearchEnrichmentHandler(
...     project='my-project',
...     vector_search_parameters=params,
...     min_batch_size=100,
...     max_batch_size=1000
... )
>>>
>>> # Use in pipeline
>>> with beam.Pipeline() as p:
...     enriched = (
...         p
...         | beam.Create([
...             EmbeddableItem(
...                 id='query1',
...                 embedding=Embedding(dense_embedding=[0.1, 0.2, 0.3]),
...                 content=Content(text='test query'),
...                 metadata={'language': 'en'}
...             )
...         ])
...         | Enrichment(handler)
...     )

Parameters:

vector_search_parameters – Configuration for the vector search query
min_batch_size – Minimum number of items to process in one batch
max_batch_size – Maximum number of items to process in one batch
log_query – Debug option to log the BigQuery query
**kwargs – Additional arguments passed to bigquery.Client

The handler will: 1. Batch incoming embeddable_items according to batch size parameters 2. Format and execute vector search query for each batch 3. Join results back to original embeddable_items 4. Return tuples of (original_embeddable_item, search_results)

batch_elements_kwargs() → dict[str, int][source]: Returns kwargs for beam.BatchElements.

apache_beam.ml.rag.enrichment.bigquery_vector_search.join_fn(left: Embedding, right: dict[str, Any]) → Embedding[source]