Beam SQL overview
Beam SQL allows a Beam user (currently only available in Beam Java and Python) to query
bounded and unbounded PCollections
with SQL statements. Your SQL query
is translated to a PTransform
, an encapsulated segment of a Beam pipeline.
You can freely mix SQL PTransforms
and other PTransforms
in your pipeline.
Beam SQL includes the following dialects:
Beam Calcite SQL is a variant of Apache Calcite, a dialect widespread in big data processing. Beam Calcite SQL is the default Beam SQL dialect. Beam ZetaSQL is more compatible with BigQuery, so it’s especially useful in pipelines that write to or read from BigQuery tables.
To change dialects, pass the dialect’s full package name to the setPlannerName
method in the PipelineOptions
interface.
There are two additional concepts you need to know to use SQL in your pipeline:
- SqlTransform: the interface for creating
PTransforms
from SQL queries. - Row:
the type of elements that Beam SQL operates on. A
PCollection<Row>
plays the role of a table.
Walkthrough
The SQL pipeline walkthrough works through how to use Beam SQL with example code.
Shell
The Beam SQL shell allows you to write pipelines as SQL queries without using the Java SDK. The Shell page describes how to work with the interactive Beam SQL shell.
Apache Calcite dialect
The Beam Calcite SQL overview summarizes Apache Calcite operators, functions, syntax, and data types supported by Beam Calcite SQL.
ZetaSQL dialect
For more information on the ZetaSQL features in Beam SQL, see the Beam ZetaSQL dialect reference.
To switch to Beam ZetaSQL, configure the pipeline options as follows:
PipelineOptions options = ...;
options
.as(BeamSqlPipelineOptions.class)
.setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner");
Note, Use of the ZetaSQLQueryPlanner
requires an additional dependency on beam-sdks-java-extensions-sql-zetasql
in addition to the beam-sdks-java-extensions-sql
package required for CalciteQueryPlanner
.
Beam SQL extensions
Beam SQL has additional extensions leveraging Beam’s unified batch/streaming model and processing complex data types. You can use these extensions with all Beam SQL dialects.