A Quick Demo of Apache Beam with Docker

Apache Beam is a unified programming model to create Batch and Stream data processing pipelines. Simplifying a bit, it's a Java SDK that we can use to develop analytics pipelines, such as for counting users or events, extracting trending topics, or analyzing user sessions.

In this post, we are going to see how to launch a demo of Apache Beam in minutes, thanks to a Docker image with Apache Flink and Beam pre-packaged and ready to use. As a bonus, the repo used to create this demo is also available on GitHub to get started creating Beam pipelines.

To run this demo, we need Docker and Docker Compose. For building Beam pipelines we also need Java and Maven.

Deploy Flink & Beam with Docker

Let's get started and deploy a Flink cluster with Docker Compose.

First get the latest repo — in fact, all we need is the docker-compose.yml file:

Run the cluster (we can also scale it up or down):

That's it! We now have a running Flink cluster:

The launch script also uploads a custom JAR with Beam pre-packaged and ready to use. For more technical details on the cluster, refer to the repo.

Run HelloWo — ehm, WordCount

Let's now run our first Beam pipeline.

Open Flink web UI, exposed on port 48080. For example, on Mac or Windows:

Then follow these steps:

  1. Click “Submit new Job” in the left menu — we'll see beam-starter-0.1.jar pre-uploaded
  2. Flag the checkbox near beam-starter-0.1.jar
  3. Click on “Submit” (or “Show Plan”). No additional parameter is needed.

Congratulations, we have now run our first Beam pipeline!

Here's what happened under the hood. The JAR is built from this starter repo. By default it runs the class WordCount, with input file /tmp/kinglear.txt and output file /tmp/output.txt. The input file is also pre-loaded in the Docker image, so all Flink task managers can read it.

We can check the result of WordCount by connecting to a task manager and looking at the content of the output file. Note that the output file name starts with /tmp/output.txt as multiple files may be created, depending on the pipeline:

Build a Beam Pipeline

Finally, let's look at how to build our own JAR file.

The repo used for this WordCount demo is based on Beam documentation for the Flink runner, with minor changes to overcome some imprecision. This repo is a good starter to build any Beam pipeline.

Just clone the starter repo and build it:

The last command will create a JAR file within the target/ directory. We can upload the JAR via the Flink web UI and run it as we saw above.

From here, it should be easy to tweak the file WordCount.java and create other Beam pipelines.

As a concrete example, I've used Beam to analyze trending topics on Twitter during the Oscars. And you? Are you planning a project with Beam, perhaps by using this starter repo or Docker image? I'd love to hear about it!

Forging the Everdragons2 NFT. Former security at Pinterest.

Forging the Everdragons2 NFT. Former security at Pinterest.