Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems

Building large scale systems that deal with a considerable amount of data often requires numerous ETL jobs and different processing mechanisms. In our case, for example, the ETL process consists of many transformations, such as normalizing, aggregating, deduplicating and enriching millions of car data records. These kinds of processes generally start by running manual flows, but comes a time when we need to start automating these tasks. To do so, we would need a Workflow Management System. This blog post will compare four of them: Luigi, Airflow, Pinball and Chronos.

Let’s get started.

Why Use a Workflow Management System?

Managing complex workflows and scheduling them can seem easy, so you might fall into the trap of building them yourself. However, this is probably not the best idea, because it’s likely you will encounter challenges that can be easily overcome by automation.

For example, when trying to run periodic tasks. In this case, you may need to write an algorithm that takes the output of one job and uses it as the input for another job. This task depends on preceding jobs to be completed successfully, because they have to run in chronological order. If the first task didn’t run properly, it could create errors later on. Your scheduler should be able to handle those kinds of situations, without requiring you to be constantly on the lookout for bugs and errors while wading knee-deep into your code and its dependencies.

The same thing happened to us here at Otonomo. We created a Spark Job on top of a managed Hadoop cluster that converts a given dataset into a standardized format. It then writes and saves these new files into an optimized partitioned location. Because we receive large amounts of data regularly and sporadically, we’ve decided to perform this action once an hour, ensuring we don’t miss out on any important data.

Then, after the conversion has taken place and the new files are written, we aggregate the data from the last 12 hours and offer summary reports of the findings. This process seems pretty simple, yet if a step fails (i.e, if an hour of data is missing) the result of the aggregated and summarized data is inaccurate and false.

Therefore, we needed to ensure that the proceeding task is triggered only when our conversion task runs and succeeds. If it did not succeed, a different task would be triggered. In other words, we needed a mechanism that would support the idea of jobs being triggered by the completion of other jobs. That’s when we decided we needed an ETL workflow framework, with a scheduler that would trigger the appropriate tasks it is programmed to.

Comparing Workflow Management Systems

Approximately 18 months ago, we looked into four main open source projects that we thought were useful for long dependency chains:

Luigi

Luigi is a fairly popular open source project created by Spotify. It has a lot of great reviews online and the user interface for creating job flows is very easy to use. However, Luigi does not have a trigger mechanism, and as mentioned before, we needed a scheduler that was capable of finding and triggering new deployed tasks. Additionally, Luigi does not assign tasks to workers and isn’t highly capable of monitoring schedules.

Airflow

Airflow is an open source project developed by AirBnB. It is supported by a large community of software engineers and can be utilized with a lot of different frameworks, including AWS. The maturity level of this project is high, yet it’s currently in the process of stabilization as it is being incubated by Apache.

Pinball

Pinball is an open source project built by Pinterest. It currently runs on Python 2, so they are a bit behind in terms of new capabilities (we use Python 3). The user interface for Pinball was not user friendly and rather challenging to figure out. It also appeared to be unmaintained.

Chronos

Chronos is another open source project created by AirBnB that runs on Mesos. Mesos is a distributing mechanism that manages computing resources, thereby allowing elastic applications to easily be built and created. Using Chronos would require us to build and maintain a Mesos environment, which isn’t worth doing just for scheduling capabilities. If we were not a cloud native platform, we would have considered using DC/OS (by Mesosphere) and then Chronos would we be a much more appealing option.

Workflow Management System Comparison Table

 LuigiAirflowPinballChronos
GitHub Contributors37865419105
Major Known ContributorsSpotifyAirBnBPinterest 
License TypeApache Version 2.0Apache Version 2.0Apache Version 2.0Apache Version 2.0
Commit Size3,7235,7021331,140
Commit FrequencyDailyDailyEvery Few MonthsEvery Few Months
Built-In SchedulerNoYesYesYes
Trigger CapabilityNoYesYesYes
Distributed Execution CapabilityNoYesYesYes

Choosing the Best Workflow Management System for Us

Our main focus when doing the research was to find a framework that is maintained, has a built-in scheduler and can easily run on AWS cloud. As seen in the table, Airflow and Luigi were both highly maintained, but due to the lack of built-in scheduler in Luigi, Airflow had the most to offer.

At this point in time, we chose to go with Airflow. We believe it’s the best fit for job orchestration within our business, especially since we work in a Big Data cloud based environment.

Airflow itself uses DAGs (Directed Acyclic Graphs) which are composed of tasks, with dependencies between them. Those can be scheduled to run periodically, or triggered from the completion of another task. It uses a SQL database to store the state of the DAGs, and can scale using Celery to allow tasks to run on remote workers. We run Airflow on Docker containers on ECS, using Celery to spread the load of the tasks on multiple containers.

More for Developers

Otonomo is more than a car data exchange. Read these blogs written by developers, for developers, about coding, technology and culture.

Spark Cache Applied at Large Scale – Challenges, Pitfalls and Solutions

The ultimate guide for Spark cache and Spark memory. Learn to apply Spark caching on production with confidence, for large-scales of data. Everything Spark cache.
Ofek Hod
November 18, 2021

@Otonomo: An Innovative Approach to Software Delivery

In our Behind the Scenes Otonomo series, we talk to people from across the Otonomo family to hear what makes their job unique, and the innovative ways they take on their role within the company.
Nir Nahum - Software Engineering Team Leader
June 15, 2021

How We Run CI/CD in Our Development Process new

We developed a CI/CD pipeline to assist our R&D save time when merging to the master branch. Learn about our environment challenges, cloud pricing, and more
Danny Gitelman
August 12, 2019

Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems

A comparison of Luigi, Airflow, Pinball and Chronos. Choose the best workflow management system for your automated jobs based on features and abilities.
Hilla Shapira
June 5, 2019

How to Count Large Scale Geohashes

A brand new effective way to count geohashes in any given region at any level, even in continents. Learn how you can now analyze geohashes properly.
Itamar Landsman
June 3, 2019

Deleting Code Matters

Deleting parts of your code is hard but necessary. Read how keeping your code short is better for code maintenance, reducing bugs etc., and best practices.
Tzahi Furmanski
May 28, 2019

Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?

Choose the best microservices message broker for your communication needs. Read this comparison of Redis, Kafka and RabbitMQ and become an expert.
Sefi Itzkovich - CTO
May 20, 2019