Beam ZetaSQL lexical structure
Beam ZetaSQL statements are comprised of a series of tokens. Tokens include identifiers, quoted identifiers, literals, keywords, operators, and special characters. Tokens can be separated by whitespace (space, backspace, tab, newline) or comments.
Identifiers
Identifiers are names that are associated with columns, tables, and other database objects.
There are two ways to specify an identifier: unquoted or quoted:
- Unquoted identifiers must begin with a letter or an underscore. Subsequent characters can be letters, numbers, or underscores.
- Quoted identifiers are enclosed by backtick (`) characters and can contain any character, such as spaces or symbols. However, quoted identifiers cannot be empty. Reserved Keywords can only be used as identifiers if enclosed by backticks.
Syntax (presented as a grammar with regular expressions, ignoring whitespace):
identifier: { quoted_identifier | unquoted_identifier } unquoted_identifier:[A-Za-z_][A-Za-z_0-9]*
quoted_identifier:\`[^\\\`\r\n]
any_escape*\`
any_escape:\\(. | \n | \r | \r\n)
Examples:
Customers5
_dataField1
ADGROUP
Invalid examples:
5Customers
_dataField!
GROUP
5Customers
begins with a number, not a letter or underscore. _dataField!
contains the special character "!" which is not a letter, number, or underscore.
GROUP
is a reserved keyword, and therefore cannot be used as an identifier
without being enclosed by backtick characters.
Both identifiers and quoted identifiers are case insensitive, with some nuances. See Case Sensitivity for further details.
Quoted identifiers have the same escape sequences as string literals, defined below.
Literals
A literal represents a constant value of a built-in data type. Some, but not all, data types can be expressed as literals.
String and Bytes Literals
Both string and bytes literals must be quoted, either with single ('
) or
double ("
) quotation marks, or triple-quoted with groups of three single
('''
) or three double ("""
) quotation marks.
Quoted literals:
Literal | Examples | Description |
---|---|---|
Quoted string |
| Quoted strings enclosed by single (' ) quotes can contain unescaped double (" ) quotes, and vice versa.Backslashes ( \ ) introduce escape sequences. See Escape Sequences table below.Quoted strings cannot contain newlines, even when preceded by a backslash ( \ ). |
Triple-quoted string |
| Embedded newlines and quotes are allowed without escaping - see fourth example. Backslashes ( \ ) introduce escape sequences. See Escape Sequences table below.A trailing unescaped backslash ( \ ) at the end of a line is not allowed.Three unescaped quotes in a row which match the starting quotes will end the string. |
Raw string |
| Quoted or triple-quoted literals that have the raw string literal prefix (r or R ) are interpreted as raw/regex strings.Backslash characters ( \ ) do not act as escape characters. If a backslash followed by another character occurs inside the string literal, both characters are preserved.A raw string cannot end with an odd number of backslashes. Raw strings are useful for constructing regular expressions. |
Prefix characters (r
, R
, b
, B)
are optional for quoted or triple-quoted strings, and indicate that the string is a raw/regex string or a byte sequence, respectively. For
example, b'abc'
and b'''abc'''
are both interpreted as type bytes. Prefix characters are case insensitive.
Quoted literals with prefixes:
The table below lists all valid escape sequences for representing non-alphanumeric characters in string literals. Any sequence not in this table produces an error.
Escape Sequence | Description |
---|---|
\a | Bell |
\b | Backspace |
\f | Formfeed |
\n | Newline |
\r | Carriage Return |
\t | Tab |
\v | Vertical Tab |
\\ | Backslash (\ ) |
\? | Question Mark (? ) |
\" | Double Quote (" ) |
\' | Single Quote (' ) |
\` | Backtick (` ) |
\ooo | Octal escape, with exactly three digits (in the range 0-7). Decodes to a single Unicode character (in string literals). |
\xhh or \Xhh | Hex escape, with exactly two hex digits (0-9 or A-F or a-f). Decodes to a single Unicode character (in string literals). Examples:
|
\uhhhh | Unicode escape, with lowercase 'u' and exactly four hex digits. Valid only in string literals or identifiers. Note that the range D800-DFFF is not allowed, as these are surrogate unicode values. |
\Uhhhhhhhh | Unicode escape, with uppercase 'U' and exactly eight hex digits. Valid only in string literals or identifiers. Note that the range D800-DFFF is not allowed, as these are surrogate unicode values. Also, values greater than 10FFFF are not allowed. |
Integer Literals
Integer literals are either a sequence of decimal digits (0 through
9) or a hexadecimal value that is prefixed with "0x
". Integers can be prefixed
by "+
" or "-
" to represent positive and negative values, respectively.
Examples:
123
0xABC
-123
An integer literal is interpreted as an INT64
.
Floating Point Literals
Syntax options:
[+-]DIGITS.[DIGITS][e[+-]DIGITS]
[DIGITS].DIGITS[e[+-]DIGITS]
DIGITSe[+-]DIGITS
DIGITS
represents one or more decimal numbers (0 through 9) and e
represents the exponent marker (e or E).
Examples:
123.456e-67
.1E4
58.
4e2
Numeric literals that contain either a decimal point or an exponent marker are presumed to be type double.
Implicit coercion of floating point literals to float type is possible if the value is within the valid float range.
There is no literal representation of NaN or infinity.
Array Literals
Array literals are a comma-separated lists of elements
enclosed in square brackets. The ARRAY
keyword is optional, and an explicit
element type T is also optional.
Examples:
[1, 2, 3]
['x', 'y', 'xy']
ARRAY[1, 2, 3]
ARRAY<string>['x', 'y', 'xy']
Timestamp literals
Syntax:
TIMESTAMP 'YYYY-[M]M-[D]D [[H]H:[M]M:[S]S[.DDDDDD]] [timezone]'
Timestamp literals contain the TIMESTAMP
keyword and a string literal that
conforms to the canonical timestamp format, enclosed in single quotation marks.
Timestamp literals support a range between the years 1 and 9999, inclusive. Timestamps outside of this range are invalid.
A timestamp literal can include a numerical suffix to indicate the timezone:
TIMESTAMP '2014-09-27 12:30:00.45-08'
If this suffix is absent, the default timezone, UTC, is used.
For example, the following timestamp represents 12:30 p.m. on September 27, 2014, using the timezone, UTC:
TIMESTAMP '2014-09-27 12:30:00.45'
For more information on timezones, see Timezone.
String literals with the canonical timestamp format, including those with
timezone names, implicitly coerce to a timestamp literal when used where a
timestamp expression is expected. For example, in the following query, the
string literal "2014-09-27 12:30:00.45 America/Los_Angeles"
is coerced
to a timestamp literal.
SELECT * FROM foo
WHERE timestamp_col = "2014-09-27 12:30:00.45 America/Los_Angeles"
Timezone
Since timestamp literals must be mapped to a specific point in time, a timezone is necessary to correctly interpret a literal. If a timezone is not specified as part of the literal itself, then the default timezone value, which is set by the Beam SQL implementation, is used.
Timezones are represented by strings in the following canonical format, which represents the offset from Coordinated Universal Time (UTC).
Format:
(+|-)H[H][:M[M]]
Examples:
'-08:00'
'-8:15'
'+3:00'
'+07:30'
'-7'
Timezones can also be expressed using string timezone names from the
tz database. For a less comprehensive but
simpler reference, see the
List of tz database timezones
on Wikipedia. Canonical timezone names have the format
<continent/[region/]city>
, such as America/Los_Angeles
.
Note that not all timezone names are interchangeable even if they do happen to
report the same time during a given part of the year. For example, America/Los_Angeles
reports the same time as UTC-7:00
during Daylight Savings Time, but reports the same time as UTC-8:00
outside of Daylight Savings Time.
Example:
TIMESTAMP '2014-09-27 12:30:00 America/Los_Angeles'
TIMESTAMP '2014-09-27 12:30:00 America/Argentina/Buenos_Aires'
Case Sensitivity
Beam SQL follows these rules for case sensitivity:
Category | Case Sensitive? | Notes |
---|---|---|
Keywords | No | |
Function names | No | |
Table names | See Notes | Table names are usually case insensitive, but may be case sensitive when querying a database that uses case sensitive table names. |
Column names | No | |
String values | Yes | |
String comparisons | Yes | |
Aliases within a query | No | |
Regular expression matching | See Notes | Regular expression matching is case sensitive by default, unless the expression itself specifies that it should be case insensitive. |
LIKE matching | Yes |
Reserved Keywords
Keywords are a group of tokens that have special meaning in the Beam SQL language, and have the following characteristics:
- Keywords cannot be used as identifiers unless enclosed by backtick (`) characters.
- Keywords are case insensitive.
Beam SQL has the following reserved keywords.
ALL AND ANY ARRAY AS ASC ASSERT_ROWS_MODIFIED AT BETWEEN BY CASE CAST COLLATE CONTAINS CREATE CROSS CUBE CURRENT DEFAULT DEFINE DESC DISTINCT ELSE END | ENUM ESCAPE EXCEPT EXCLUDE EXISTS EXTRACT FALSE FETCH FOLLOWING FOR FROM FULL GROUP GROUPING GROUPS HASH HAVING IF IGNORE IN INNER INTERSECT INTERVAL INTO | IS JOIN LATERAL LEFT LIKE LIMIT LOOKUP MERGE NATURAL NEW NO NOT NULL NULLS OF ON OR ORDER OUTER OVER PARTITION PRECEDING PROTO RANGE | RECURSIVE RESPECT RIGHT ROLLUP ROWS SELECT SET SOME STRUCT TABLESAMPLE THEN TO TREAT TRUE UNBOUNDED UNION UNNEST USING WHEN WHERE WINDOW WITH WITHIN |
Terminating Semicolons
Statements can optionally use a terminating semicolon (;
) in the context of a query string submitted through an Application Programming Interface (API). Some interactive tools require statements to have a terminating semicolon.
In a request containing multiple statements, statements must be separated by semicolons, but the semicolon is optional for the final statement.
Comments
Comments are sequences of characters that are ignored by the parser. Beam SQL supports the following types of comments.
Single line comments
Single line comments are supported by prepending #
or --
before the
comment.
Examples
SELECT x FROM T; # x is a field and T is a table
Comment includes all characters from the '#' character to the end of the line.
SELECT x FROM T; --x is a field and T is a table
Comment includes all characters from the '--
' sequence to the end of the line. You can optionally add a space after the '--
'.
Multiline comments
Multiline comments are supported by enclosing the comment using /* <comment> */
.
Example:
SELECT x FROM T /* x is a field and T is a table */
WHERE x = 3;
Invalid example:
SELECT x FROM T /* comment starts here
/* comment ends on this line */
this line is not considered a comment */
WHERE x = 3;
Comment includes all characters, including newlines, enclosed by the first
occurrence of '/*
' and the first subsequent occurrence of '*/
'. Nested
comments are not supported. The second example contains a nested comment that
renders the query invalid.