Managing Python Pipeline Dependencies

Dependency management is about specifying dependencies that your pipeline requires, and controlling which dependencies are used in production.

Note: Remote workers used for pipeline execution typically have a standard Python distribution installation in a Debian-based container image. If your code relies only on standard Python packages, then you probably don’t need to do anything on this page.

PyPI Dependencies

If your pipeline uses public packages from the Python Package Index, you must make these packages available remotely on the workers.

For pipelines that consists only of a single Python file or a notebook, the most straightforward way to supply dependencies is to provide a requirements.txt file. For more complex scenarios, define the pipeline in a package and consider installing your dependencies in a custom container.

To supply a requirements.txt file:

  1. Find out which packages are installed on your machine. Run the following command:

     pip freeze > requirements.txt
    

    This command creates a requirements.txt file that lists all packages that are installed on your machine, regardless of where they were installed from.

  2. Edit the requirements.txt file and delete all packages that are not relevant to your code.

  3. Run your pipeline with the following command-line option:

     --requirements_file requirements.txt
    

    The runner will use the requirements.txt file to install your additional dependencies onto the remote workers.

NOTE: As an alternative to pip freeze, use a library like pip-tools to compile all of the dependencies required for the pipeline from a requirements.in file. In the requirements.in file, only the top-level dependencies are mentioned.

When you supply the --requirements_file pipeline option, during pipeline submission, Beam downloads the specified packages locally into a requirements cache directory, and then stages the requirements cache directory to the runner. At runtime, when available, Beam installs packages from the requirements cache. This mechanism makes it possible to stage the dependency packages to the runner at submission. At runtime, the runner workers might be able to install the packages from the cache without needing a connection to PyPI. To disable staging the requirements, use the --requirements_cache=skip pipeline option. For more information, see the help descriptions of these pipeline options.

Custom Containers

You can pass a container image with all the dependencies that are needed for the pipeline. Follow the instructions the show how to run the pipeline with custom container images.

  1. If you are using a custom container image, we recommend that you install the dependencies from the --requirements_file directly into your image at build time. In this case, you do not need to pass --requirements_file option at runtime, which will reduce the pipeline startup time.

    # Add these lines with the path to the requirements.txt to the Dockerfile
    COPY <path to requirements.txt> /tmp/requirements.txt
    RUN python -m pip install -r /tmp/requirements.txt
    

Local Python packages or non-public Python Dependencies

If your pipeline uses packages that are not available publicly (e.g. packages that you’ve downloaded from a GitHub repo), make these packages available remotely by performing the following steps:

  1. Identify which packages are installed on your machine and are not public. Run the following command:

    pip freeze

    This command lists all packages that are installed on your machine, regardless of where they were installed from.

    1. Run your pipeline with the following command-line option:

       --extra_package /path/to/package/package-name
      

      where package-name is the package’s tarball. You can build the package tarball using a command line tool called build.

        # Install build using pip
        pip install --upgrade build
        python -m build --sdist
      

      See the build documentation for more details on this command.

Multiple File Dependencies

Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps:

  1. Create a setup.py file for your project. The following is a very basic setup.py file.

     import setuptools
    
     setuptools.setup(
        name='PACKAGE-NAME',
        version='PACKAGE-VERSION',
        install_requires=[
          # List Python packages your pipeline depends on.
        ],
        packages=setuptools.find_packages(),
     )
    
  2. Structure your project so that the root directory contains the setup.py file, the main workflow file, and a directory with the rest of the files, for example:

     root_dir/
       setup.py
       main.py
       my_package/
         my_pipeline_launcher.py
         my_custom_dofns_and_transforms.py
         other_utils_and_helpers.py
    

    See Juliaset for an example that follows this project structure.

  3. Install your package in the submission environment, for example by using the following command:

     pip install -e .
    
  4. Run your pipeline with the following command-line option:

     --setup_file /path/to/setup.py
    

Note: It is not necessary to supply the --requirements_file option if the dependencies of your package are defined in the install_requires field of the setup.py file (see step 1). However unlike with the --requirements_file option, when you use the --setup_file option, Beam doesn’t stage the dependent packages to the runner. Only the pipeline package is staged. If they aren’t already provided in the runtime environment, the package dependencies are installed from PyPI at runtime.

Non-Python Dependencies or PyPI Dependencies with Non-Python Dependencies

If your pipeline uses non-Python packages, such as packages that require installation using the apt install command, or uses a PyPI package that depends on non-Python dependencies during package installation, we recommend installing them using a custom container. Otherwise, you must perform the following steps.

  1. Structure your pipeline as a package.

  2. Add the required installation commands for the non-Python dependencies, such as the apt install commands, to the list of CUSTOM_COMMANDS in your setup.py file. See the Juliaset setup.py file for an example.

    Note: You must verify that these commands run on the remote worker. For example, if you use apt, the remote worker needs apt support.

  3. Run your pipeline with the following command-line option:

     --setup_file /path/to/setup.py
    

Note: Because custom commands execute after the dependencies for your workflow are installed (by pip), you should omit the PyPI package dependency from the pipeline’s requirements.txt file and from the install_requires parameter in the setuptools.setup() call of your setup.py file.

Pre-building SDK Container Image

In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via --requirements_file and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start with --prebuild_sdk_container_engine. For instructions of how to use pre-building with Google Cloud Dataflow, see Pre-building the python SDK custom container image with extra dependencies.

NOTE: This feature is available only for the Dataflow Runner v2.

Pickling and Managing the Main Session

When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using libraries that perform the serialization (also called picklers). The default pickler library used by Beam is dill. To use the cloudpickle pickler, supply the --pickle_library=cloudpickle pipeline option. The cloudpickle support is currently experimental.

By default, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job. Thus, one might encounter an unexpected NameError when running a DoFn on any remote runner. To resolve this, supply the main session content with the pipeline by setting the --save_main_session pipeline option. This will load the pickled state of the global namespace onto the Dataflow workers (if using DataflowRunner). For example, see Handling NameErrors to set the main session on the DataflowRunner.

Managing the main session in Python SDK is only necessary when using dill pickler on any remote runner. Therefore, this issue will not occur in DirectRunner.

Since serialization of the pipeline happens on the job submission, and deserialization happens at runtime, it is imperative that the same version of pickling library is used at job submission and at runtime. To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of dill or cloudpickle required by Beam, and choose to install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container, or by specifying a pipeline dependency requirement).

Control the dependencies the pipeline uses

Pipeline environments

To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a runner-independent representation and submits it for execution. Translation happens in the launch environment. You can launch the pipeline from a Python virtual environment with the installed Beam SDK, or with tools like Dataflow Flex Templates, Notebook environments, Apache Airflow, and more.

The runtime environment is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to when it performs data processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.

Create reproducible environments

You can use several tools to build reproducible Python environments:

Use version control for the configuration files that define the environment.

Make the pipeline runtime environment reproducible

When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline’s direct or transitive dependencies. It doesn’t require dependency resolution at runtime.

You can create a reproducible runtime environment in the following ways:

A self-contained runtime environment is usually reproducible. To check if the runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the --no_use_public_ips pipeline option.

If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:

Make the pipeline launch environment reproducible

The launch environment runs the production version of the pipeline. While developing the pipeline locally, you might use a development environment that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the development environment.

To reduce side-effects on pipeline submissions, it is best to able to recreate the launch environment in a reproducible manner.

Dataflow Flex Templates provide an example of a containerized, reproducible launch environment.

To create reproducible installations of Beam into a clean virtual environment, use requirements files that list all Python dependencies included in Beam’s default container images constraint files:

BEAM_VERSION=2.48.0
PYTHON_VERSION=`python -c "import sys; print(f'{sys.version_info.major}{sys.version_info.minor}')"`
pip install apache-beam==$BEAM_VERSION --constraint https://raw.githubusercontent.com/apache/beam/release-${BEAM_VERSION}/sdks/python/container/py${PY_VERSION}/base_image_requirements.txt

Use a constraint file to ensure that Beam dependencies in the launch environment match the versions in default Beam containers. A constraint file might also remove the need for dependency resolution at installation time.

Make the launch environment compatible with the runtime environment

The launch environment translates the pipeline graph into a runner-independent representation. This process involves serializing (or pickling) the code of the transforms. The serialized content is deserialized on the workers. If the runtime worker environment significantly differs from the launch environment, runtime errors might occur for the following reasons:

To check whether the runtime environment matches the launch environment, inspect differences in the pip freeze output in both environments. Update to the latest version of Beam, because environment compatibility checks are included in newer SDK versions.

Finally, you can use the same environment by launching the pipeline from the containerized environment that you use at runtime. Dataflow Flex templates built from a custom container image offer this setup. In this scenario, you can recreate both launch and runtime environments in a reproducible manner. Because both containers are created from the same image, the launch and runtime environments are compatible with each other by default.