blog & release
2023/08/30
Apache Beam 2.50.0Robert Burke [@lostluck]
We are happy to present the new 2.50.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.
For more information on changes in 2.50.0, check out the detailed release notes.
Highlights
- Spark 3.2.2 is used as default version for Spark runner (#23804).
- The Go SDK has a new default local runner, called Prism (#24789).
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures.
I/Os
- Java KafkaIO now supports picking up topics via topicPattern (#26948)
- Support for read from Cosmos DB Core SQL API (#23604)
- Upgraded to HBase 2.5.5 for HBaseIO. (Java) (#27711)
- Added support for GoogleAdsIO source (Java) (#27681).
New Features / Improvements
- The Go SDK now requires Go 1.20 to build. (#27558)
- The Go SDK has a new default local runner, Prism. (#24789).
- Prism is a portable runner that executes each transform independantly, ensuring coders.
- At this point it supercedes the Go direct runner in functionality. The Go direct runner is now deprecated.
- See https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/README.md for the goals and features of Prism.
- Hugging Face Model Handler for RunInference added to Python SDK. (#26632)
- Hugging Face Pipelines support for RunInference added to Python SDK. (#27399)
- Vertex AI Model Handler for RunInference now supports private endpoints (#27696)
- MLTransform transform added with support for common ML pre/postprocessing operations (#26795)
- Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. (#27635)
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures. (#27674). The multi-arch container images include:
- All versions of Go, Python, Java and Typescript SDK containers.
- All versions of Flink job server containers.
- Java and Python expansion service containers.
- Transform service controller container.
- Spark3 job server container.
- Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).(#21429)
Breaking Changes
- Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
- Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the
--staging_location
at pipeline submission. Custom container images that are not based on Beam’s default image must include Apache Beam installation.(#26996)
Deprecations
- The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
- Tests can be set back to the direct runner by overriding TestMain:
func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }
- It’s recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
- Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
- Do not rely on closures or using package globals for DoFn configuration. They don’t function on portable runners.
- Tests can be set back to the direct runner by overriding TestMain:
Bugfixes
- Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option
direct_num_workers!=1
.(#27373) - Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security (#27474)
- Beam Python containers rely on a version of Debian/aom that has several security vulnerabilities: CVE-2021-30474, CVE-2021-30475, CVE-2021-30473, CVE-2020-36133, CVE-2020-36131, CVE-2020-36130, and CVE-2020-36135.
Known Issues
- Long-running Python pipelines might experience a memory leak: #28246.
- Python Pipelines using BigQuery IO or
orjson
dependency might experience segmentation faults or get stuck: #28318. - Python SDK’s cross-language Bigtable sink mishandles records that don’t have an explicit timestamp set: #28632. To avoid this issue, set explicit timestamps for all records before writing to Bigtable.
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Abacn
acejune
AdalbertMemSQL
ahmedabu98
Ahmed Abualsaud
al97
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Anton Shalkovich
ArjunGHUB
Bjorn Pedersen
BjornPrime
Brett Morgan
Bruno Volpato
Buqian Zheng
Burke Davison
Byron Ellis
bzablocki
case-k
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Connor Brett
Damon
Damon Douglas
Dan Hansen
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmytro Sadovnychyi
Florent Biville
Gabriel Lacroix
Hai Joey Tran
Hong Liang Teoh
Jack McCluskey
James Fricker
Jeff Kinard
Jeff Zhang
Jing
johnjcasey
jon esperanza
Josef Šimánek
Kenneth Knowles
Laksh
Liam Miller-Cushon
liferoad
magicgoody
Mahmud Ridwan
Manav Garg
Marco Vela
martin trieu
Mattie Fu
Michel Davit
Moritz Mack
mosche
Peter Sobot
Pranav Bhandari
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Saba Sathya
Sam Whittle
Steven Niemitz
Steven van Rossum
Svetak Sundhar
Tony Tang
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Yichi Zhang
Yi Hu
Zechen Jiang