Apache Beam 2.53.0

We are happy to present the new 2.53.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.53.0, check out the detailed release notes.

Highlights

  • Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: (#27330).

I/Os

  • TextIO now supports skipping multiple header lines (Java) (#17990).
  • Python GCSIO is now implemented with GCP GCS Client instead of apitools (#25676)
  • Adding support for LowCardinality DataType in ClickHouse (Java) (#29533).
  • Added support for handling bad records to KafkaIO (Java) (#29546)
  • Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.(#29564)
  • NATS IO connector added (Go) (#29000).

New Features / Improvements

  • The Python SDK now type checks collections.abc.Collections types properly. Some type hints that were erroneously allowed by the SDK may now fail. (#29272)
  • Running multi-language pipelines locally no longer requires Docker. Instead, the same (generally auto-started) subprocess used to perform the expansion can also be used as the cross-language worker.
  • Framework for adding Error Handlers to composite transforms added in Java (#29164).
  • Python 3.11 images now include google-cloud-profiler (#29561).

Deprecations

  • Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) (#29451)

Bugfixes

  • (Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs (#27330).
  • (Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection (#29600).

Security Fixes

Known Issues

  • Potential race condition causing NPE in DataflowExecutionStateSampler in Dataflow Java Streaming pipelines (#29987).
  • Some Python pipelines that run with 2.52.0-2.54.0 SDKs and use large materialized side inputs might be affected by a performance regression. To restore the prior behavior on these SDK versions, supply the --max_cache_memory_usage_mb=0 pipeline option. (#30360).
  • Python pipelines that run with 2.53.0-2.54.0 SDKs and perform file operations on GCS might be affected by excess HTTP requests. This could lead to a performance regression or a permission issue. (#28398)
  • In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (#30679).
  • Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue (#32169). The issue will be fixed in 2.59.0 (#32135). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.

For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md

List of Contributors

According to git shortlog, the following people contributed to the 2.53.0 release. Thank you to all contributors!

Ahmed Abualsaud

Ahmet Altay

Alexey Romanenko

Anand Inguva

Arun Pandian

Balázs Németh

Bruno Volpato

Byron Ellis

Calvin Swenson Jr

Chamikara Jayalath

Clay Johnson

Damon

Danny McCormick

Ferran Fernández Garrido

Georgii Zemlianyi

Israel Herraiz

Jack McCluskey

Jacob Tomlinson

Jan Lukavský

JayajP

Jeffrey Kinard

Johanna Öjeling

Julian Braha

Julien Tournay

Kenneth Knowles

Lawrence Qiu

Mark Zitnik

Mattie Fu

Michel Davit

Mike Williamson

Naireen

Naireen Hussain

Niel Markwick

Pablo Estrada

Radosław Stankiewicz

Rebecca Szper

Reuven Lax

Ritesh Ghorse

Robert Bradshaw

Robert Burke

Sam Rohde

Sam Whittle

Shunping Huang

Svetak Sundhar

Talat UYARER

Tom Stepp

Tony Tang

Vlado Djerek

Yi Hu

Zechen Jiang

clmccart

damccorm

darshan-sj

gabry.wu

johnjcasey

liferoad

lrakla

martin trieu

tvalentyn