Danny Gitelman

Danny Gitelman is the DevOps Lead at Otonomo. Danny Served in an elite IDF technological unit managing army mission critical servers. He gained knowledge in Cloud, Security and CI/CD while working both at large companies and startups.

How We Run CI/CD in Our Development Process new

August 12, 2019

As DevOps team leader, I view the company’s engineering team as our “customers״. Since time is a very important resource for them (and for everyone for that matter), we set out to help them save some. After some research, we recognized that a lot of time is wasted in the process of bringing the work-in-progress code to the main branch in our company’s version control (we use Git). So, we decided to find a solution for them which turned out to be a complete CI/CD pipeline dedicated to easing and shortening the development process. This blog post focuses on how we built a CI/CD pipeline for development which is different than building one for production. While the processes are similar, not much information is available online about bringing the code to the master branch. Here at Otonomo, the dev pipeline includes working with remote development environments, merging to mainstream strategy and making the process efficient for around 30 engineers.

Why use a CI/CD pipeline for development?

The pipeline increases productivity by eliminating time wasted understanding what’s wrong in the env and how to clean it up.
The pipeline reduces stress and frustration by eliminating the need to clean up an environment.
A comprehensive development pipeline means less bugs in production, because they are found beforehand.

The Challenges before CI/CD

When Otonomo was a small company, it was quite easy to manage our development resources and environments. Every engineer had his or her own full environment (which was mostly under-utilized and idle) and we had a very small number of packages/services to build and deploy. As we grew, challenges started and it became financially inefficient to maintain the same number of cloud resources as we previously had. As a result, we lacked enough environments and our productivity decreased. We answered the need by sharing dev environments. However, once we did, problems started sprouting like mushrooms after the rain.

Error-prone environments

Sharing environments requires constant maintenance and can lead to mistakes. Engineers have to always make sure the env they’re using is “clean” by deleting previous settings, structures, DB changes, infra modification, etc. This requires time and a decent amount of knowledge. If it is not done properly, it will cause errors.

Rising costs

To overcome the problems of shared envs, we had to increase the number of environments we were using, which lead to very high bills. This made even less financial sense, considering that almost 50% of the time, the environments were not being used (nights, weekends).

Making cross changes was hard

Applying changes, for example, adding a new cloud resource or a DB schema, consumed time and resources. Even with automated scripts.

Merged code was buggy

Merges to the mainstream were made without fully checking end-2-end compatibility with all services, and problems got worse with infrastructure-related changes. In fact, many bugs and errors were discovered *after* the code was merged to master, either by chance, or through tests we ran over staging and production (the discussion about unit tests/integration is out of scope for this, but we do have those).

The solution: a CI/CD pipeline

The answer we found to these problems was creating a CI/CD pipeline built out of 3 blocks (layers) that eases the whole environment management requirements (infra-wise) and merges clean, new code to the master.

Block #1- Environments and infrastructure (a.k.a Nightly)

The first and foremost problem we tackled was to allow our engineers to work on a clean and stable development environment. In order to get a fresh environment, we implemented a process we call “Nightly”, an automatic env recreation that deletes the previous day’s work. At the end of the day, the “Delete” process is fired automatically. Early in the morning, the “Create” process starts, before the first engineers arrive at work. This process is triggered automatically, through a scheduler. The complete process, executed on our entire development environments, is as follows:

Scorching all databases (MySQL, AWS metastore, AWS DynamoDB, etc.).
Updating/Creating new infra changes and Deleting old changes. This gives us the ability to provision new changes across all environments without the need to trigger cross env update processes.
Rebuilding all microservices and packages from latest code.
Deploying all services, components, external services, etc. (e.g spark applications.)
Filling databases with fresh dummy data.
Re-enabling scheduled tasks to avoid our Slack being spammed with errors the whole night.
Running full end-2-end tests on each env.

Here is our Jenkins job that controls the nightly process:

In addition, by deleting environments at night, we reduce around 40% of the cost of each environment. During weekends for instance, we keep only 2 environments for emergency use. Here is a sample daily cost of dev environments, specifically AWS EC2 and Elasticsearch:

Block #2- Code deployments and merges to mainstream (a.k.a Pull Request- End 2 End a.k.a PR-E2E)

Our next challenge was split into two:

Being able to quickly deploy any service to the deployment environment.
Automatically merging fixed code to the mainstream (master git branch). The merge should take place only after checking and deploying the changed code (i.e running integration and functional tests on the new code).

Our deployment process consists of three main logics. We call it “The BCD process” and each engineer can choose which services/packages/resources to build/create/deploy. The deployment brings the code, resources or infra changes to a remote environment in the cloud

Faster code deployment

In order to speed up this process, we made use of the advantages of building microservices on top of Docker containers. So we pre-built base Docker images for each microservice that contains all the relevant packages and dependencies. When the “Create” step starts, it only has to install the new packages and not all packages. Another action we take is running in parallel where possible:

Packing our packages (where applicable)
Running unit tests
Creating our Docker images for the micro services
Deploying services

Quality automated merges to mainstream

After tackling the development environments, and after we (hopefully) satisfied our engineers, our next challenge was to ensure we merge only stable (i.e it passed tests) code to our mainstream. This reduces the amount of bugs we find out at later stages, like staging or even production. To ensure every pull request to the master was fully checked prior to the merge to the Master, we applied “The BCD Process” on a dedicated environment used solely for automatic merges verifications. Approved code is merged into the Master, while buggy code is rejected and the environment is reverted to its previous state. However, we were lacking the ability to automatically understand which component needed to be rebuilt and tested, i.e which part of the code was changed in the specified component. At first, we chose the naïve path – building all the packages and deploying all the services. The big pain in this solution was time. Our platform comprises of a large variety of components: microservices, serverless functions (i.e AWS lambdas), EMR clusters and Flink. Building, testing, creating docker images and finally deploying ~20 services for only a single pull request took around 1.5 hours. To make the process more efficient, we had to understand which files were changed in the code and map them to the relevant package, service, etc. We wrote a little tool that calculates changes of a relevant pull request, and compares to the master branch. Once we obtain the list of changed dirs and files, we can associate each file to the relevant resource (and also get more info about it – whether it is enabled, the type of resource, dependencies).

SOME_SERVICE = Resource(name="some_service", type=rt.SERVICE, packages_dependency=[COMMON_PACKAGE], actions={at.DEPLOY: True},
                                   dependent_paths=["some_path"])
ANOTHER_SERVICE = Resource(name="another_service", type=rt.SERVICE, packages_dependency=[COMMON_PACKAGE, SOME_PACKAGE],
                                   actions={at.DEPLOY: True}, dependent_paths=["some_path"])

Now, once we calculate exactly what changed, we are able to map it to the resource. Then, we can trigger an automatic merge-to-master flow that will build, create and deploy only the relevant services and its dependencies. This reduces the time of the merge by more than 50% on average.

Block #3- Bots Bots Bots

One of our challenges as the DevOps team is to maintain high transparency between the development infrastructure (i.e development environments status, merge to master failures) and the engineers. Managing an automated process of a nightly environment creation or auto calculation to PRs (Pull Requests) for ~30 engineers creates maintenance overhead. So, we use bots to show them what we’re doing. Bots make management easier, save time for engineers and increase transparency. Welcome Bobby (we think he is a relative of little bobby tables) Bobby is a slack bot, and he (yes, he’s part of the Otonomo family) is responsible for allocating development environments to engineers. One of the issues Bobby solves is multiple engineers working on the same environment. With dozens of dev environments, its very easy to ruin someone elses work by deploying different code/change to an occupied environment. His second ability is to preserve environments at night, in cases we do not want to delete a specific environment. For instance, a massive infra change was made on an environment but the work is still not completed- we do not want all of those changes to go to the trash at night. Bobby also indicates the status of an environment to the engineers, and if it’s faulted during the nightly process it will be marked as such.

Once someone tries to deploy to a locked environment, there will be an error.

Welcome Gandalf This slack/github bot is responsible for managing all interactions related to automatic pull request merges. By using gandalf, we insert PRs to a queue where each PR is deployed to the dedicated environment (and builds only the changed code and its dependencies) in a serial manner. Once the deploy is completed successfully it fires the merge command to github.

In order to increase transparency, we post each merge to master on a dedicated slack channel. This helps us quickly understand branches were merged, and sometimes which related change was made. This is very useful when debugging, for example.

Technical challenges we overcame when implementing CI/CD

Creating and implementing the process wasn’t always easy. Here are the main technical challenges we had to deal with, and how we overcame them.

API throttling

We use the python AWS api (boto3) for our environment creation. Once we started to run dozens of processes in parallel, the number of errors (throttle errors) increased, making our nightly process very flaky. To overcome this, we reduced the number of API calls to the minimum by optimimzing the call-request ratio. In addition, we also use a @retry decorator, which catches throttle exceptions and retires in an increasing amount of time until it exhausts.

Package dependencies

The parallel package building process can be risky due to dependencies. If package A is dependant on package B, you cannot build them in parallel because you B to be built first. This slows down the process or could cause errors. To overcome this, we always built the common packages first, and then we built the packages that have dependencies.

Resource mapping challenges

Resource and dependency mapping is hard to maintain manually. As of now, we have to maintain a list of our resources and their definition. To overcome this, we will dynamically generate mapping using smart templating and some sort of key/value store or framework of service discovery.

To summarize our process, we:

Implemented an automated process to destroy and create development environments that ensures all envs are clean every day and reduces cost.
Save cloud development costs by deleting resources during nights and weekends.
Are able to apply cross changes without interference – Nightly will do it for us.
Can lock an environment and work on it safely without interruptions from other engineers
Automatically manage the merge queue by a bot.

Otonomo offers developers a car data trial service for developing apps and services with car data. Start now.

Danny Gitelman

More for Developers

Otonomo is more than a car data exchange. Read these blogs written by developers, for developers, about coding, technology and culture.

Spark Cache Applied at Large Scale – Challenges, Pitfalls and Solutions

The ultimate guide for Spark cache and Spark memory. Learn to apply Spark caching on production with confidence, for large-scales of data. Everything Spark cache.

Ofek Hod

November 18, 2021

@Otonomo: An Innovative Approach to Software Delivery

In our Behind the Scenes Otonomo series, we talk to people from across the Otonomo family to hear what makes their job unique, and the innovative ways they take on their role within the company.

Nir Nahum - Software Engineering Team Leader

June 15, 2021

How We Run CI/CD in Our Development Process new

We developed a CI/CD pipeline to assist our R&D save time when merging to the master branch. Learn about our environment challenges, cloud pricing, and more

Danny Gitelman

August 12, 2019

Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems

A comparison of Luigi, Airflow, Pinball and Chronos. Choose the best workflow management system for your automated jobs based on features and abilities.

Hilla Shapira

June 5, 2019

How to Count Large Scale Geohashes

A brand new effective way to count geohashes in any given region at any level, even in continents. Learn how you can now analyze geohashes properly.

Itamar Landsman

June 3, 2019

Deleting Code Matters

Deleting parts of your code is hard but necessary. Read how keeping your code short is better for code maintenance, reducing bugs etc., and best practices.

Tzahi Furmanski

May 28, 2019

Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?

Choose the best microservices message broker for your communication needs. Read this comparison of Redis, Kafka and RabbitMQ and become an expert.

Sefi Itzkovich - CTO

May 20, 2019