Building Beam Python SDK Image Guide

There are two options to build Beam Python SDK image. If you only need to modify the Python SDK boot entrypoint binary, read Update Boot Entrypoint Application Only. If you need to build a Beam Python SDK image fully, read Build Beam Python SDK Image Fully.

Update Boot Entrypoint Application Only.

If you only need to make a change to the Python SDK boot entrypoint binary. You can rebuild the boot application only and include the updated boot application in the preexisting image. Read the Python container Dockerfile for reference.

# From beam repo root, make changes to boot.go.
your_editor sdks/python/container/boot.go

# Rebuild the entrypoint
./gradlew :sdks:python:container:gobuild

cd sdks/python/container/build/target/launcher/linux_amd64

# Create a simple Dockerfile to use custom boot entrypoint.
cat >Dockerfile <<EOF
FROM apache/beam_python3.10_sdk:2.60.0
COPY boot /opt/apache/beam/boot
EOF

# Build the image
docker build . --tag us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom-boot
docker push us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom-boot

You can build a docker image if your local environment has Java, Python, Golang and Docker installation. Try ./gradlew :sdks:python:container:py<PYTHON_VERSION>:docker. For example, :sdks:python:container:py310:docker builds apache/beam_python3.10_sdk locally if successful. You can follow this guide building a custom image from a VM if the build fails in your local environment.

Build Beam Python SDK Image Fully

This section introduces a way to build everything from the scratch.

Prepare VM

Prepare a VM with Debian 11. This guide was tested on Debian 11.

Google Compute Engine

An option to create a Debian 11 VM is using a GCE instance.

gcloud compute instances create beam-builder \
  --zone=us-central1-a  \
  --image-project=debian-cloud \
  --image-family=debian-11 \
  --machine-type=n1-standard-8 \
  --boot-disk-size=20GB \
  --scopes=cloud-platform

Login to the VM. All the following steps are executed inside the VM.

gcloud compute ssh beam-builder --zone=us-central1-a --tunnel-through-iap

Update the apt package list.

sudo apt-get update

[!NOTE]

  • A high CPU machine is recommended to reduce the compile time.
  • The image build needs a large disk. The build will fail with “no space left on device” with the default disk size 10GB.
  • The cloud-platform is recommended to avoid permission issues with Google Cloud Artifact Registry. You can use the default scopes if you don’t push the image to Google Cloud Artifact Registry.
  • Use a zone in the region of your docker repository of Artifact Registry if you push the image to Artifact Registry.

Prerequisite Packages

Java

You need Java to run Gradle tasks.

sudo apt-get install -y openjdk-11-jdk

Golang

Download and install. Reference: https://go.dev/doc/install.

# Download and install
curl -OL  https://go.dev/dl/go1.23.2.linux-amd64.tar.gz
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.2.linux-amd64.tar.gz

# Add go to PATH.
export PATH=:/usr/local/go/bin:$PATH

Confirm the Golang version

go version

Expected output:

go version go1.23.2 linux/amd64

[!NOTE] Old Go version (e.g. 1.16) will fail at :sdks:python:container:goBuild.

Python

This guide uses Pyenv to manage multiple Python versions. Reference: https://realpython.com/intro-to-pyenv/#build-dependencies

# Install dependencies
sudo apt-get install -y make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev \
libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev

# Install Pyenv
curl https://pyenv.run | bash

# Add pyenv to PATH.
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Install Python 3.9 and set the Python version. This will take several minutes.

pyenv install 3.9
pyenv global 3.9

Confirm the python version.

python --version

Expected output example:

Python 3.9.17

[!NOTE] You can use a different Python version for building with -PpythonVersion option to Gradle task run. Otherwise, you should have python3.9 in the build environment for Apache Beam 2.60.0 or later (python3.8 for older Apache Beam versions). If you use the wrong version, the Gradle task :sdks:python:setupVirtualenv fails.

Docker

Install Docker following the reference.

# Add GPG keys.
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the Apt repository.
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

# Install docker packages.
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

You need to run docker command without the root privilege in Beam Python SDK image build. You can do this by adding your account to the docker group.

sudo usermod -aG docker $USER
newgrp docker

Confirm if you can run a container without the root privilege.

docker run hello-world

Git

Git is not necessary for building Python SDK image. Git is just used to download the Apache Beam code in this guide.

sudo apt-get install -y git

Build Beam Python SDK Image

Download Apache Beam from the Github repository.

git clone https://github.com/apache/beam beam
cd beam

Make changes to the Apache Beam code.

Run the Gradle task to start Docker image build. This will take several minutes. You can run :sdks:python:container:py<PYTHON_VERSION>:docker to build an image for different Python version. See the supported Python version list. For example, py310 is for Python 3.10.

./gradlew :sdks:python:container:py310:docker

If the build is successful, you can see the built image locally.

docker images

Expected output:

REPOSITORY                   TAG       IMAGE ID       CREATED              SIZE
apache/beam_python3.10_sdk   2.60.0    33db45f57f25   About a minute ago   2.79GB

[!NOTE] If you run the build in your local environment and Gradle task :sdks:python:setupVirtualenv fails by an incompatible python version, please try with -PpythonVersion with the Python version installed in your local environment (e.g. -PpythonVersion=3.10)

Push to Repository

You may push the custom image to a image repository. The image can be used for Dataflow custom container.

Google Cloud Artifact Registry

You can push the image to Artifact Registry. No additional authentication is necessary if you use Google Compute Engine.

docker tag apache/beam_python3.10_sdk:2.60.0 us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom
docker push us-central1-docker.pkg.dev/<MY_PROJECT>/<MY_REPOSITORY>/beam_python3.10_sdk:2.60.0-custom

If you push an image in an environment other than a VM in Google Cloud, you should configure docker authentication with gcloud before docker push.

Docker Hub

You can push your Docker hub repository after docker login.

docker tag apache/beam_python3.10_sdk:2.60.0 <my-account>/beam_python3.10_sdk:2.60.0-custom
docker push <my-account>/beam_python3.10_sdk:2.60.0-custom