Apache Beam Java SDK Quickstart
If you’re interested in contributing to the Apache Beam Java codebase, see the Contribution Guide.
Set up your Development Environment
Optional: Install Gradle if you would like to convert your Maven project into Gradle.
Get the WordCount Code
The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam’s WordCount examples and builds against the most recent Beam release:
PS> mvn archetype:generate ` -D archetypeGroupId=org.apache.beam ` -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples ` -D archetypeVersion=2.28.0 ` -D groupId=org.example ` -D artifactId=word-count-beam ` -D version="0.1" ` -D package=org.apache.beam.examples ` -D interactiveMode=false
This will create a directory
word-count-beam that contains a simple
pom.xml and a series of example pipelines that count words in text files.
PS> cd .\word-count-beam PS> dir ... Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 7/19/2018 11:00 PM src -a---- 7/19/2018 11:00 PM 16051 pom.xml PS> dir .\src\main\java\org\apache\beam\examples ... Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 7/19/2018 11:00 PM common d----- 7/19/2018 11:00 PM complete d----- 7/19/2018 11:00 PM subprocess -a---- 7/19/2018 11:00 PM 7073 DebuggingWordCount.java -a---- 7/19/2018 11:00 PM 5945 MinimalWordCount.java -a---- 7/19/2018 11:00 PM 9490 WindowedWordCount.java -a---- 7/19/2018 11:00 PM 7662 WordCount.java
For a detailed introduction to the Beam concepts used in these examples, see the WordCount Example Walkthrough. Here, we’ll just focus on executing
Optional: Convert from Maven to Gradle Project
Ensure you are in the same directory as the
pom.xml file generated from the previous step. Automatically convert your project from Maven to Gradle by running:
You’ll be asked if you want to generate a Gradle build. Enter yes. You’ll also be prompted to choose a DSL (Groovy or Kotlin). This tutorial uses Groovy, so select that if you don’t have a preference.
After you have converted the project to Gradle:
- In the generated
build.gradlefile, in the
- Add the following task in
build.gradleto allow you to execute pipelines with Gradle:
- Rebuild your project by running:
Run a pipeline
A single Beam pipeline can run on multiple Beam runners, including the FlinkRunner, SparkRunner, NemoRunner, JetRunner, or DataflowRunner. The DirectRunner is a common runner for getting started, as it runs locally on your machine and requires no specific setup. If you’re just trying out Beam and you’re not sure what to use, use the DirectRunner.
The general process for running a pipeline goes like this:
- Ensure you’ve done any runner-specific setup.
- Build your command line:
- Specify a runner with
--runner=<runner>(defaults to the DirectRunner).
- Add any runner-specific required options.
- Choose input files and an output location that are accessible to the runner. (For example, you can’t access a local file if you are running the pipeline on an external cluster.)
- Specify a runner with
- Run the command.
To run the WordCount pipeline, see the Maven and Gradle examples below.
Run WordCount Using Maven
For Unix shells:
$ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=target/word-count-beam-bundled-0.1.jar \ --inputFile=/path/to/quickstart/pom.xml --output=/tmp/counts" -Pflink-runner You can monitor the running job by visiting the Flink dashboard at http://<flink master>:8081
Make sure you complete the setup steps at /documentation/runners/dataflow/#setup $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \ --region=<your-gcp-region> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \ -Pdataflow-runner
For Windows PowerShell:
PS> mvn package exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=.\target\word-count-beam-bundled-0.1.jar ` --inputFile=C:\path\to\quickstart\pom.xml --output=C:\tmp\counts" -P flink-runner You can monitor the running job by visiting the Flink dashboard at http://<flink master>:8081
Make sure you complete the setup steps at /documentation/runners/dataflow/#setup PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=DataflowRunner --project=<your-gcp-project> ` --region=<your-gcp-region> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp ` --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" ` -P dataflow-runner
Run WordCount Using Gradle
For Unix shells (Instructions currently only available for Direct, Spark, and Dataflow):
Inspect the results
Once the pipeline has completed, you can view the output. You’ll notice that there may be multiple output files prefixed by
count. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution.
When you look into the contents of the file, you’ll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency.
- Learn more about the Beam SDK for Java and look through the Java SDK API reference.
- Walk through these WordCount examples in the WordCount Example Walkthrough.
- Take a self-paced tour through our Learning Resources.
- Dive in to some of our favorite Videos and Podcasts.
- Join the Beam users@ mailing list.
Please don’t hesitate to reach out if you encounter any issues!
Last updated on 2021/01/29
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!