Why Use a Workflow Management System?Managing complex workflows and scheduling them can seem easy, so you might fall into the trap of building them yourself. However, this is probably not the best idea, because it’s likely you will encounter challenges that can be easily overcome by automation. For example, when trying to run periodic tasks. In this case, you may need to write an algorithm that takes the output of one job and uses it as the input for another job. This task depends on preceding jobs to be completed successfully, because they have to run in chronological order. If the first task didn’t run properly, it could create errors later on. Your scheduler should be able to handle those kinds of situations, without requiring you to be constantly on the lookout for bugs and errors while wading knee-deep into your code and its dependencies. The same thing happened to us here at Otonomo. We created a Spark Job on top of a managed Hadoop cluster that converts a given dataset into a standardized format. It then writes and saves these new files into an optimized partitioned location. Because we receive large amounts of data regularly and sporadically, we’ve decided to perform this action once an hour, ensuring we don’t miss out on any important data. Then, after the conversion has taken place and the new files are written, we aggregate the data from the last 12 hours and offer summary reports of the findings. This process seems pretty simple, yet if a step fails (i.e, if an hour of data is missing) the result of the aggregated and summarized data is inaccurate and false. Therefore, we needed to ensure that the proceeding task is triggered only when our conversion task runs and succeeds. If it did not succeed, a different task would be triggered. In other words, we needed a mechanism that would support the idea of jobs being triggered by the completion of other jobs. That’s when we decided we needed an ETL workflow framework, with a scheduler that would trigger the appropriate tasks it is programmed to.
Comparing Workflow Management SystemsApproximately 18 months ago, we looked into four main open source projects that we thought were useful for long dependency chains:
LuigiLuigi is a fairly popular open source project created by Spotify. It has a lot of great reviews online and the user interface for creating job flows is very easy to use. However, Luigi does not have a trigger mechanism, and as mentioned before, we needed a scheduler that was capable of finding and triggering new deployed tasks. Additionally, Luigi does not assign tasks to workers and isn’t highly capable of monitoring schedules.
AirflowAirflow is an open source project developed by AirBnB. It is supported by a large community of software engineers and can be utilized with a lot of different frameworks, including AWS. The maturity level of this project is high, yet it’s currently in the process of stabilization as it is being incubated by Apache.
PinballPinball is an open source project built by Pinterest. It currently runs on Python 2, so they are a bit behind in terms of new capabilities (we use Python 3). The user interface for Pinball was not user friendly and rather challenging to figure out. It also appeared to be unmaintained.
ChronosChronos is another open source project created by AirBnB that runs on Mesos. Mesos is a distributing mechanism that manages computing resources, thereby allowing elastic applications to easily be built and created. Using Chronos would require us to build and maintain a Mesos environment, which isn’t worth doing just for scheduling capabilities. If we were not a cloud native platform, we would have considered using DC/OS (by Mesosphere) and then Chronos would we be a much more appealing option.
Workflow Management System Comparison Table
|Major Known Contributors||Spotify||AirBnB|
|License Type||Apache Version 2.0||Apache Version 2.0||Apache Version 2.0||Apache Version 2.0|
|Commit Frequency||Daily||Daily||Every Few Months||Every Few Months|
|Distributed Execution Capability||No||Yes||Yes||Yes|