Beam SQL overview

Beam SQL allows a Beam user (currently only available in Beam Java and Python) to query bounded and unbounded PCollections with SQL statements. Your SQL query is translated to a PTransform, an encapsulated segment of a Beam pipeline. You can freely mix SQL PTransforms and other PTransforms in your pipeline.

Beam SQL uses Calcite SQL based on Apache Calcite, a dialect widespread in big data processing.

Note: Beam SQL supports for ZetaSQL dialect has been deprecated.

To change dialects, pass the dialect’s full package name to the setPlannerName method in the PipelineOptions interface.

There are two additional concepts you need to know to use SQL in your pipeline:

SqlTransform: the interface for creating PTransforms from SQL queries.
Row: the type of elements that Beam SQL operates on. A PCollection<Row> plays the role of a table.

Walkthrough

The SQL pipeline walkthrough works through how to use Beam SQL with example code.

Shell

The Beam SQL shell allows you to write pipelines as SQL queries without using the Java SDK. The Shell page describes how to work with the interactive Beam SQL shell.

Apache Calcite dialect

The Beam Calcite SQL overview summarizes Apache Calcite operators, functions, syntax, and data types supported by Beam Calcite SQL.

Beam SQL extensions

Beam SQL has additional extensions leveraging Beam’s unified batch/streaming model and processing complex data types. You can use these extensions with all Beam SQL dialects.