“Apache Beam and its abstraction of the execution engines is a big thing for us. The amount of work that that saves...it would be hard to build that support for Dataflow or Spark all by yourself. It is amazing that this technology exists in the first place, really amazing! Not having to worry about all those underlying platforms - that is tremendous!”
Visual Apache Beam Pipeline Design and Orchestration with Apache Hop
Background
Apache Hop is an open source data orchestration and data engineering platform that aims to facilitate all aspects of data processing with visual pipeline development environment. This easy-to-use, fast, and flexible platform enables developers to create and manage Apache Beam batch and streaming pipelines in Hop GUI. Apache Hop uses metadata and kernel to describe how the data should be processed, and Apache Beam to “design once, run anywhere”.
Neo4j’s Chief Solutions Architect, Matt Casters, has been an early adopter of Apache Beam and its abstraction of execution engines. Matt has been an active member of the Apache open-source community for years and has leveraged Apache Beam as an execution engine to build Apache Hop.
Apache Hop Project
Thriving popularity and the growing number of Apache Beam users across the globe inspired Matt Casters to expand the idea of abstraction to visual pipeline lifecycle management and development. Matt co-founded and incubated the Apache Hop project that became a top level project at the Apache Software Foundation in December 2021. The platform enables users of all skill levels to build, test, launch, and deploy powerful data workflows without writing code. Apache Hop’s intuitive drag and drop interface provides a visual representation of Apache Beam pipelines, simplifying pipeline design, execution, preview, monitoring, and debugging.
I was a big fan of Beam from the get go. Apache Beam is now a very important part of the Apache Hop project.
The Apache Hop GUI allows data professionals to work visually and focus on “what” they need to do rather than “how”, using metadata to describe how the Apache Beam pipelines should be processed. Apache Hop’s transform-agnostic action plugins (“hops”) link transforms together, creating a pipeline. Various Apache Beam runners, such as Spark , Flink , Dataflow , and the Direct runner, read the metadata with help of Apache Hop’s Metadata Provider and workflow engines(plugins) , and execute the pipeline.
Apache Hop’s custom plugins and metadata objects for some of the most popular technologies , such as Neo4j, empower users to execute database- and technology-specific transforms inside the Apache Beam pipelines, which allows for native optimized connectivity and flexible Apache Beam pipeline configurations. For instance, the Apache Hop’s Neo4j plugin stores logging and execution lineage of Apache Beam pipelines in the Neo4j graph database and enables users to query this information for more details, such as quickly jump to the place where an error occurred. The combination of Apache Hop transforms, Apache Beam built-in I/Os, and Apache Beam-powered data processing opens up new horizons for more sinks and sources and custom use cases.
Apache Hop aims to bring a no-code approach to Apache Beam data pipelines. Sometimes the choice of a particular programming language, framework, or engine is driven by developers’ preferences, which results in businesses becoming tied to a specific technology skill set and stack. Apache Hop eliminates this dependency by abstracting out the I/Os with a fully pluggable runtime support and providing a graphic user interface on top of Apache Beam pipelines. All settings for pipeline elements are performed in the Hop’s visual editor just once, and pipeline is automatically described as metadata in JSON and CSV formats. Programming data pipelines’ source code becomes an option, not a necessity. Apache Hop does not require knowledge of a particular programming language to create pipelines, helping with the adoption of Apache Beam unified streaming and batch processing technology.
In general, a visual pipeline design interface is really valuable for a non-developer audience… We categorically choose the side of the organization when it comes to lowering setup costs, maintenance costs, increasing ROI, and safeguarding an investment over time.
Results
Apache Beam continuously expands the number of use cases and scenarios it supports and makes it possible to bring advanced technology solutions into a reality. Being an early adopter of Apache Beam and its powerful abstraction, Matt Casters leveraged this knowledge and experience to create Apache Hop. The platform creates a value-add for Apache Beam users by enabling visual pipeline development and lifecycle management.
Matt sees Apache Beam as a foundation and a driving force behind Apache Hop. Communication between Apache Beam and Apache Hop projects keeps fostering co-creation and enriches both products with new features.
Apache Hop project is the example of the continuous improvement driven by the Apache open source community and amplified by collaborative organizations.
Knowledge sharing and collaboration is something that comes naturally in the community. If we see some room for improvement, we exchange ideas and this way, we keep driving Apache Beam and Apache Hop projects forward. Together, we can work with the most complex problems and just solve them.
Was this information useful?