Beam ZetaSQL lexical structure
Beam ZetaSQL supports standard SQL scalar data types as well as extensions including arrays, maps, and nested rows. This page documents the ZetaSQL data types supported in Beam ZetaSQL.
Data type properties
The following table contains data type properties and the data types that each property applies to:
Property | Description | Applies To |
---|---|---|
Nullable | NULL is a valid value. | All data types, with the following exceptions:
|
Orderable | Can be used in an ORDER BY clause. | All data types except for:
|
Groupable | Can generally appear in an expression followingGROUP BY , DISTINCT , or PARTITION BY .However, PARTITION BY expressions cannot includethe floating point types FLOAT and DOUBLE . | All data types except for:
|
Comparable | Values of the same type can be compared to each other. | All data types, with the following exceptions: ARRAY comparisons are not supported.
|
Numeric types
Numeric types include integer types and floating point types.
Integer type
Integers are numeric values that do not have fractional components.
Name | Storage Size | Range |
---|---|---|
INT64 | 8 bytes | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
Floating point type
Floating point values are approximate numeric values with fractional components.
Name | Storage Size | Description |
---|---|---|
FLOAT64 | 8 bytes | Double precision (approximate) decimal values. |
Boolean type
Name | Description |
---|---|
BOOL | Boolean values are represented by the keywords TRUE and
FALSE (case insensitive). |
String type
Name | Description |
---|---|
STRING | Variable-length character (Unicode) data. |
Input STRING values must be UTF-8 encoded and output STRING values will be UTF-8 encoded. Alternate encodings like CESU-8 and Modified UTF-8 are not treated as valid UTF-8.
All functions and operators that act on STRING values operate on Unicode
characters rather than bytes. For example, when functions like SUBSTR
and LENGTH
are applied to STRING input, the functions count Unicode characters, not bytes. Comparisons are
defined on Unicode characters. Comparisons for less than and ORDER BY
compare
character by character, and lower unicode code points are considered lower
characters.
Bytes type
Name | Description |
---|---|
BYTES | Variable-length binary data. |
STRING and BYTES are separate types that cannot be used interchangeably. Casts between STRING and BYTES enforce that the bytes are encoded using UTF-8.
Timestamp type
Caution: SQL has millisecond TIMESTAMP
precision. If a
TIMESTAMP
field has sub-millisecond precision, SQL
throws an IllegalArgumentException
.
Name | Description | Range |
---|---|---|
TIMESTAMP | Represents an absolute point in time, with millisecond precision. | 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999 UTC. |
A timestamp represents an absolute point in time, independent of any time zone or convention such as Daylight Savings Time.
Canonical format
YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDD]][time zone]
YYYY
: Four-digit year[M]M
: One or two digit month[D]D
: One or two digit day( |T)
: A space or aT
separator[H]H
: One or two digit hour (valid values from 00 to 23)[M]M
: One or two digit minutes (valid values from 00 to 59)[S]S
: One or two digit seconds (valid values from 00 to 59)[.DDD]
: Up to three fractional digits (i.e. up to millisecond precision)[time zone]
: String representing the time zone. See the time zones section for details.
Time zones are used when parsing timestamps or formatting timestamps for display. The timestamp value itself does not store a specific time zone. A string-formatted timestamp may include a time zone. When a time zone is not explicitly specified, the default time zone, UTC, is used.
Time zones
Time zones are represented by strings in one of these two canonical formats:
- Offset from Coordinated Universal Time (UTC), or the letter
Z
for UTC - Time zone name from the tz database
Offset from Coordinated Universal Time (UTC)
Offset Format
(+|-)H[H][:M[M]]
Z
Examples
-08:00
-8:15
+3:00
+07:30
-7
Z
When using this format, no space is allowed between the time zone and the rest of the timestamp.
2014-09-27 12:30:00.45-8:00
2014-09-27T12:30:00.45Z
Time zone name
Time zone names are from the tz database. For a less comprehensive but simpler reference, see the List of tz database time zones on Wikipedia.
Format
continent/[region/]city
Examples
America/Los_Angeles
America/Argentina/Buenos_Aires
When using a time zone name, a space is required between the name and the rest of the timestamp:
2014-09-27 12:30:00.45 America/Los_Angeles
Note that not all time zone names are interchangeable even if they do happen to
report the same time during a given part of the year. For example,
America/Los_Angeles
reports the same time as UTC-7:00
during Daylight
Savings Time, but reports the same time as UTC-8:00
outside of Daylight
Savings Time.
If a time zone is not specified, the default time zone value is used.
Leap seconds
A timestamp is simply an offset from 1970-01-01 00:00:00 UTC, assuming there are exactly 60 seconds per minute. Leap seconds are not represented as part of a stored timestamp.
If your input contains values that use ":60" in the seconds field to represent a leap second, that leap second is not preserved when converting to a timestamp value. Instead that value is interpreted as a timestamp with ":00" in the seconds field of the following minute.
Leap seconds do not affect timestamp computations. All timestamp computations are done using Unix-style timestamps, which do not reflect leap seconds. Leap seconds are only observable through functions that measure real-world time. In these functions, it is possible for a timestamp second to be skipped or repeated when there is a leap second.
Array type
Name | Description |
---|---|
ARRAY | Ordered list of zero or more elements of any non-ARRAY type. |
An ARRAY is an ordered list of zero or more elements of non-ARRAY values.
ARRAYs of ARRAYs are not allowed. Queries that would produce an ARRAY of
ARRAYs will return an error. Instead a STRUCT must be inserted between the
ARRAYs using the SELECT AS STRUCT
construct.
An empty ARRAY and a NULL
ARRAY are two distinct values. ARRAYs can contain
NULL
elements.
Declaring an ARRAY type
ARRAY types are declared using the angle brackets (<
and >
). The type
of the elements of an ARRAY can be arbitrarily complex with the exception that
an ARRAY cannot directly contain another ARRAY.
Format
ARRAY<T>
Examples
Type Declaration | Meaning |
---|---|
ARRAY<INT64> | Simple ARRAY of 64-bit integers. |
ARRAY<STRUCT<INT64, INT64>> | An ARRAY of STRUCTs, each of which contains two 64-bit integers. |
ARRAY<ARRAY<INT64>> (not supported) | This is an invalid type declaration which is included here just in case you came looking for how to create a multi-level ARRAY. ARRAYs cannot contain ARRAYs directly. Instead see the next example. |
ARRAY<STRUCT<ARRAY<INT64>>> | An ARRAY of ARRAYS of 64-bit integers. Notice that there is a STRUCT between the two ARRAYs because ARRAYs cannot hold other ARRAYs directly. |
Struct type
Name | Description |
---|---|
STRUCT | Container of ordered fields each with a type (required) and field name (optional). |
Declaring a STRUCT type
STRUCT types are declared using the angle brackets (<
and >
). The type of
the elements of a STRUCT can be arbitrarily complex.
Format
STRUCT<T>
Examples
Type Declaration | Meaning |
---|---|
STRUCT<INT64> | Simple STRUCT with a single unnamed 64-bit integer field. |
STRUCT<x STRUCT<y INT64, z INT64>> | A STRUCT with a nested STRUCT named x inside it. The STRUCT
x has two fields, y and z , both of which
are 64-bit integers. |
STRUCT<inner_array ARRAY<INT64>> | A STRUCT containing an ARRAY named inner_array that holds
64-bit integer elements. |
Limited comparisons for STRUCT
STRUCTs can be directly compared using equality operators:
- Equal (
=
) - Not Equal (
!=
or<>
) - [
NOT
]IN
Notice, though, that these direct equality comparisons compare the fields of the STRUCT pairwise in ordinal order ignoring any field names. If instead you want to compare identically named fields of a STRUCT, you can compare the individual fields directly.