Container environments

The Beam SDK runtime environment can be containerized with Docker to isolate it from other runtime systems. To learn more about the container environment, read the Beam SDK Harness container contract.

Prebuilt SDK container images are released per supported language during Beam releases and pushed to Docker Hub.

Custom containers

You may want to customize container images for many reasons, including:

This guide describes how to create and use customized containers for the Beam SDKs.

Prerequisites

NOTE: On Nov 20, 2020, Docker Hub put rate limits into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times.

For optimal user experience, we also recommend you use the latest released version of Beam.

Building and pushing custom containers

Beam SDK container images are built from Dockerfiles checked into the Github repository and published to Docker Hub for every release. You can build customized containers in one of three ways:

  1. Writing a new Dockerfile based on a released container image. This is sufficient for simple additions to the image, such as adding artifacts or environment variables.
  2. Modifying a source Dockerfile in Beam. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions).
  3. Modifying an existing container image to make it compatible with Apache Beam Runners. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.

Writing a new Dockerfile based on an existing published container image

  1. Create a new Dockerfile that designates a base image using the FROM instruction.
FROM apache/beam_python3.7_sdk:2.25.0

ENV FOO=bar
COPY /src/path/to/file /dest/path/to/file/

This Dockerfile uses the prebuilt Python 3.7 SDK container image beam_python3.7_sdk tagged at (SDK version) 2.25.0, and adds an additional environment variable and file to the image.

  1. Build and push the image using Docker.
export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0"
export IMAGE_NAME="myremoterepo/mybeamsdk"
# Avoid using `latest` with custom containers to make reproducing failures easier.
export TAG="mybeamsdk-versioned-tag"

# Optional - pull the base image into your local Docker daemon to ensure
# you have the most up-to-date version of the base image locally.
docker pull "${BASE_IMAGE}"

docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" .
  1. If your runner is running remotely, retag and push the image to the appropriate repository.
docker push "${IMAGE_NAME}:${TAG}"
  1. After pushing a container image, verify the remote image ID and digest matches the local image ID and digest, output from docker build or docker images.

Modifying a source Dockerfile in Beam

This method requires building image artifacts from Beam source. For additional instructions on setting up your development environment, see the Contribution guide.

NOTE: It is recommended that you start from a stable release branch (release-X.XX.X) corresponding to the same version of the SDK to run your pipeline. Differences in SDK version may result in unexpected errors.

  1. Clone the beam repository.
export BEAM_SDK_VERSION="2.26.0"
git clone https://github.com/apache/beam.git
cd beam

# Save current directory as working directory
export BEAM_WORKDIR=$PWD

git checkout origin/release-$BEAM_SDK_VERSION
  1. Customize the Dockerfile for a given language, typically sdks/<language>/container/Dockerfile directory (e.g. the Dockerfile for Python.

  2. Return to the root Beam directory and run the Gradle docker target for your image.

cd $BEAM_WORKDIR

# The default repository of each SDK
./gradlew :sdks:java:container:java8:docker
./gradlew :sdks:java:container:java11:docker
./gradlew :sdks:java:container:java17:docker
./gradlew :sdks:go:container:docker
./gradlew :sdks:python:container:py38:docker
./gradlew :sdks:python:container:py39:docker
./gradlew :sdks:python:container:py310:docker
./gradlew :sdks:python:container:py311:docker

# Shortcut for building all Python SDKs
./gradlew :sdks:python:container:buildAll
  1. Verify the images you built were created by running docker images.
$> docker images --digests
REPOSITORY                         TAG                  DIGEST                   IMAGE ID         CREATED           SIZE
apache/beam_java8_sdk              latest               sha256:...               ...              1 min ago         ...
apache/beam_java11_sdk             latest               sha256:...               ...              1 min ago         ...
apache/beam_java17_sdk             latest               sha256:...               ...              1 min ago         ...
apache/beam_python3.6_sdk          latest               sha256:...               ...              1 min ago         ...
apache/beam_python3.7_sdk          latest               sha256:...               ...              1 min ago         ...
apache/beam_python3.8_sdk          latest               sha256:...               ...              1 min ago         ...
apache/beam_python3.9_sdk          latest               sha256:...               ...              1 min ago         ...
apache/beam_python3.10_sdk          latest               sha256:...               ...              1 min ago         ...
apache/beam_go_sdk                 latest               sha256:...               ...              1 min ago         ...
  1. If your runner is running remotely, retag the image and push the image to your repository. You can skip this step if you provide a custom repo/tag as additional parameters.
export BEAM_SDK_VERSION="2.26.0"
export IMAGE_NAME="gcr.io/my-gcp-project/beam_python3.7_sdk"
export TAG="${BEAM_SDK_VERSION}-custom"

docker tag apache/beam_python3.7_sdk "${IMAGE_NAME}:${TAG}"
docker push "${IMAGE_NAME}:${TAG}"
  1. After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from docker_images --digests.

Additional build parameters

The docker Gradle task defines a default image repository and tag is the SDK version defined at gradle.properties. The default repository is the Docker Hub apache namespace, and the default tag is the SDK version defined at gradle.properties.

You can specify a different repository or tag for built images by providing parameters to the build task. For example:

./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root="example-repo" -Pdocker-tag="2.26.0-custom"

builds the Python 3.6 container and tags it as example-repo/beam_python3.6_sdk:2.26.0-custom.

From Beam 2.21.0 and later, a docker-pull-licenses flag was introduced to add licenses/notices for third party dependencies to the docker images. For example:

./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses

creates a Java 8 SDK image with appropriate licenses in /opt/apache/beam/third_party_licenses/.

By default, no licenses/notices are added to the docker images.

Modifying an existing container image to make it compatible with Apache Beam Runners

Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a multi-stage build process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image.

  1. Copy necessary artifacts from Apache Beam base image to your image.
# This can be any container image,
FROM python:3.8-bookworm

# Install SDK. (needed for Python SDK)
RUN pip install --no-cache-dir apache-beam[gcp]==2.52.0

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.8_sdk:2.52.0 /opt/apache/beam /opt/apache/beam

# Perform any additional customizations if desired

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

NOTE: This example assumes necessary dependencies (in this case, Python 3.8 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. The version specified in the RUN instruction must match the version used to launch the pipeline.
Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.

NOTE: Any additional Python dependenices should be installed in the global Python environment in the custom image.

  1. Build and push the image using Docker.
  export BASE_IMAGE="apache/beam_python3.8_sdk:2.52.0"
  export IMAGE_NAME="myremoterepo/mybeamsdk"
  export TAG="latest"

  # Optional - pull the base image into your local Docker daemon to ensure
  # you have the most up-to-date version of the base image locally.
  docker pull "${BASE_IMAGE}"

  docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" .
  1. If your runner is running remotely, retag the image and push the image to your repository.
docker push "${IMAGE_NAME}:${TAG}"

Building a compatible container image from scratch (Go)

From the 2.55.0 release, the Beam Go SDK has moved to using distroless images as a base. These images have a reduced security attack surface by not including common tools and utilities. This may cause difficulties customizing the image with using one of the above approaches. As a fallback, it’s possible to build a custom image from scratch, by building a matching boot loader, and setting that as the container’s entry point.

For example, if it’s preferable to use alpine as the container OS your multi-stage docker file might look like the following:

FROM golang:latest-alpine AS build_base

# Set the Current Working Directory inside the container
WORKDIR /tmp/beam

# Build the Beam Go bootloader, to the local directory, matching your Beam version.
# Similar go targets exist for other SDK languages.
RUN GOBIN=`pwd` go install github.com/apache/beam/sdks/v2/go/container@v2.53.0

# Set the real base image.
FROM alpine:3.9
RUN apk add ca-certificates

# The following are required for the container to operate correctly.
# Copy the boot loader `container` to the image.
COPY --from=build_base /tmp/beam/container /opt/apache/beam/boot

# Set the container to use the newly built boot loader.
ENTRYPOINT ["/opt/apache/beam/boot"]

Build and push the new image as when modifying an existing base image above.

NOTE: Java and Python require additional dependencies, such as their runtimes, and SDK packages for a valid container image. The bootloader isn’t sufficient for creating a custom container for these SDKs.

Running pipelines with custom container images

The common method for providing a container image requires using the PortableRunner flag --environment_config as supported by the Portable Runner or by runners supported PortableRunner flags. Other runners, such as Dataflow, support specifying containers with different flags.

export IMAGE="my-repo/beam_python_sdk_custom"
export TAG="X.Y.Z"
export IMAGE_URL="${IMAGE}:${TAG}"

python -m apache_beam.examples.wordcount \
--input=/path/to/inputfile \
--output /path/to/write/counts \
--runner=PortableRunner \
--job_endpoint=embed \
--environment_type="DOCKER" \
--environment_config="${IMAGE_URL}"
export IMAGE="my-repo/beam_python_sdk_custom"
export TAG="X.Y.Z"
export IMAGE_URL = "${IMAGE}:${TAG}"

# Run a pipeline using the SparkRunner which starts the Spark job server
python -m apache_beam.examples.wordcount \
--input=/path/to/inputfile \
--output=path/to/write/counts \
--runner=SparkRunner \
# When running batch jobs locally, we need to reuse the container.
--environment_cache_millis=10000 \
--environment_type="DOCKER" \
--environment_config="${IMAGE_URL}"
export GCS_PATH="gs://my-gcs-bucket"
export GCP_PROJECT="my-gcp-project"
export REGION="us-central1"

# By default, the Dataflow runner has access to the GCR images
# under the same project.
export IMAGE="my-repo/beam_python_sdk_custom"
export TAG="X.Y.Z"
export IMAGE_URL = "${IMAGE}:${TAG}"

# Run a pipeline on Dataflow.
# This is a Python batch pipeline, so to run on Dataflow Runner V2
# you must specify the experiment "use_runner_v2"

python -m apache_beam.examples.wordcount \
  --input gs://dataflow-samples/shakespeare/kinglear.txt \
  --output "${GCS_PATH}/counts" \
  --runner DataflowRunner \
  --project $GCP_PROJECT \
  --region $REGION \
  --temp_location "${GCS_PATH}/tmp/" \
  --experiment=use_runner_v2 \
  --sdk_container_image=$IMAGE_URL

Avoid using the tag :latest with your custom images. Tag your builds with a date or a unique identifier. If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration and allow for an inspection of changes.

Troubleshooting

The following section describes some common issues to consider when you encounter unexpected errors running Beam pipelines with custom containers.