Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Migrating Apache Oozie workflows to Apache Airflow

Feng Lu (Google Cloud), James Malone (Google), Apurva Desai (Google Cloud), Cameron Moberg (Truman State University | Google Cloud)
16:3517:15 Thursday, 2 May 2019
Average rating: ****.
(4.00, 3 ratings)

Who is this presentation for?

  • Data engineers

Level

Intermediate

Prerequisite knowledge

  • Familiarity with Apache Oozie and Apache Airflow

What you'll learn

  • Explore Apache Oozie and Apache Airflow workflow specifications
  • Understand the design and implementation of a new OSS workflow migration tool
  • Discover ways to contribute and participate in the development

Description

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems. Oozie allows users to easily schedule Hadoop-related jobs out of the box (Java MapReduce, Pig, Hive, Sqoop, etc.) with support for some other system-specific jobs (SSH, Java programs, shell scripts, etc.). The Oozie workflow is defined as an XML file (most recent schema here) with, among others, control nodes that control the flow of the workflow, and action nodes that execute some sort of action. Oozie additionally supports subworkflow and allows workflow node properties to be parameterized and dynamically evaluated using EL function.

In contrast, Airflow is a generic workflow orchestration for programmatically authoring, scheduling, and monitoring workflows. A workflow (a.k.a. Direct Acyclic Graph) is expressed using Python code with APIs provided by Airflow such as Dag or Operator. Airflow not only supports Hadoop/Spark tasks (actions in Oozie) but also includes connectors to interact with many other systems such as GCP and common RDBMS. Neither Oozie nor Airflow allow cycles in their workflows.

Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution. The high-level design idea is summarized as such: Leveraging the fact that Oozie XML schema is defined in a way that there are only a finite number of top-level node types (e.g., control and action), it converts the Oozie XML file as a collection of nodes (stored in an OrderedDictionary). It then processes these nodes in order and convert them into their corresponding Airflow representations. Based on the type of the control node (fork, join, etc.), it then retrofits the dependency relationships among converted Airflow operators and tasks. The design is purposefully structured as a number of easily extendable modules. For example, you can easily extend the base ActionMapper module to support converting a new Oozie action node.

Feng, James, Apurva, and Cameron start with an overview of Oozie and Airflow, including a brief comparison, followed by a number of migration use cases. They then outline the Oozie-to-Airflow migration tool design, emphasizing its flexibility and extensibility, and wrap up with a quick demo and some future improvement ideas.

Photo of Feng Lu

Feng Lu

Google Cloud

Feng Lu is a software engineer at Google and the tech lead and manager for Cloud Composer. Feng has a broad interest in cloud and big data analytics. He holds a PhD from UC San Diego, where his research work was reported on by MIT Technology Review among others.

Photo of James Malone

James Malone

Google

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. He’s a big fan of open source software because it shows what’s possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Photo of Apurva Desai

Apurva Desai

Google Cloud

Apurva Desai leads the Dataproc, Composer, and CDAP products on the Data Analytics team at Google. Previously, Apurva led the mobile cloud team at Lenovo/Motorola, built and commercialized the Hadoop distribution at Pivotal Software, and spent six years at Yahoo leading various search and display advertising efforts as well as the Hadoop solutions team. He holds a master’s degree in EE from Simon Fraser University in Canada.

Photo of Cameron Moberg

Cameron Moberg

Truman State University | Google Cloud

Cameron Moberg is a senior computer science student at Truman State University in Missouri and a research intern on the Cloud Composer team at Google. Previously, he held two other internships at Google. Cameron has a passion for open source projects, with a recent interest in Apache Airflow and Apache Oozie.