Why use a CI/CD pipeline for development?
- The pipeline increases productivity by eliminating time wasted understanding what’s wrong in the env and how to clean it up.
- The pipeline reduces stress and frustration by eliminating the need to clean up an environment.
- A comprehensive development pipeline means less bugs in production, because they are found beforehand.
The Challenges before CI/CDWhen Otonomo was a small company, it was quite easy to manage our development resources and environments. Every engineer had his or her own full environment (which was mostly under-utilized and idle) and we had a very small number of packages/services to build and deploy. As we grew, challenges started and it became financially inefficient to maintain the same number of cloud resources as we previously had. As a result, we lacked enough environments and our productivity decreased. We answered the need by sharing dev environments. However, once we did, problems started sprouting like mushrooms after the rain.
Error-prone environmentsSharing environments requires constant maintenance and can lead to mistakes. Engineers have to always make sure the env they’re using is “clean” by deleting previous settings, structures, DB changes, infra modification, etc. This requires time and a decent amount of knowledge. If it is not done properly, it will cause errors.
Rising costsTo overcome the problems of shared envs, we had to increase the number of environments we were using, which lead to very high bills. This made even less financial sense, considering that almost 50% of the time, the environments were not being used (nights, weekends).
Making cross changes was hardApplying changes, for example, adding a new cloud resource or a DB schema, consumed time and resources. Even with automated scripts.
Merged code was buggyMerges to the mainstream were made without fully checking end-2-end compatibility with all services, and problems got worse with infrastructure-related changes. In fact, many bugs and errors were discovered *after* the code was merged to master, either by chance, or through tests we ran over staging and production (the discussion about unit tests/integration is out of scope for this, but we do have those).
The solution: a CI/CD pipelineThe answer we found to these problems was creating a CI/CD pipeline built out of 3 blocks (layers) that eases the whole environment management requirements (infra-wise) and merges clean, new code to the master.
Block #1- Environments and infrastructure (a.k.a Nightly)The first and foremost problem we tackled was to allow our engineers to work on a clean and stable development environment. In order to get a fresh environment, we implemented a process we call “Nightly”, an automatic env recreation that deletes the previous day’s work. At the end of the day, the “Delete” process is fired automatically. Early in the morning, the “Create” process starts, before the first engineers arrive at work. This process is triggered automatically, through a scheduler. The complete process, executed on our entire development environments, is as follows:
- Scorching all databases (MySQL, AWS metastore, AWS DynamoDB, etc.).
- Updating/Creating new infra changes and Deleting old changes. This gives us the ability to provision new changes across all environments without the need to trigger cross env update processes.
- Rebuilding all microservices and packages from latest code.
- Deploying all services, components, external services, etc. (e.g spark applications.)
- Filling databases with fresh dummy data.
- Re-enabling scheduled tasks to avoid our Slack being spammed with errors the whole night.
- Running full end-2-end tests on each env.
Block #2- Code deployments and merges to mainstream (a.k.a Pull Request- End 2 End a.k.a PR-E2E)Our next challenge was split into two:
- Being able to quickly deploy any service to the deployment environment.
- Automatically merging fixed code to the mainstream (master git branch). The merge should take place only after checking and deploying the changed code (i.e running integration and functional tests on the new code).
Faster code deploymentIn order to speed up this process, we made use of the advantages of building microservices on top of Docker containers. So we pre-built base Docker images for each microservice that contains all the relevant packages and dependencies. When the “Create” step starts, it only has to install the new packages and not all packages. Another action we take is running in parallel where possible:
- Packing our packages (where applicable)
- Running unit tests
- Creating our Docker images for the micro services
- Deploying services
Quality automated merges to mainstreamAfter tackling the development environments, and after we (hopefully) satisfied our engineers, our next challenge was to ensure we merge only stable (i.e it passed tests) code to our mainstream. This reduces the amount of bugs we find out at later stages, like staging or even production. To ensure every pull request to the master was fully checked prior to the merge to the Master, we applied “The BCD Process” on a dedicated environment used solely for automatic merges verifications. Approved code is merged into the Master, while buggy code is rejected and the environment is reverted to its previous state. However, we were lacking the ability to automatically understand which component needed to be rebuilt and tested, i.e which part of the code was changed in the specified component. At first, we chose the naïve path – building all the packages and deploying all the services. The big pain in this solution was time. Our platform comprises of a large variety of components: microservices, serverless functions (i.e AWS lambdas), EMR clusters and Flink. Building, testing, creating docker images and finally deploying ~20 services for only a single pull request took around 1.5 hours. To make the process more efficient, we had to understand which files were changed in the code and map them to the relevant package, service, etc. We wrote a little tool that calculates changes of a relevant pull request, and compares to the master branch. Once we obtain the list of changed dirs and files, we can associate each file to the relevant resource (and also get more info about it – whether it is enabled, the type of resource, dependencies).
Now, once we calculate exactly what changed, we are able to map it to the resource. Then, we can trigger an automatic merge-to-master flow that will build, create and deploy only the relevant services and its dependencies. This reduces the time of the merge by more than 50% on average.
Block #3- Bots Bots BotsOne of our challenges as the DevOps team is to maintain high transparency between the development infrastructure (i.e development environments status, merge to master failures) and the engineers. Managing an automated process of a nightly environment creation or auto calculation to PRs (Pull Requests) for ~30 engineers creates maintenance overhead. So, we use bots to show them what we’re doing. Bots make management easier, save time for engineers and increase transparency. Welcome Bobby (we think he is a relative of little bobby tables) Bobby is a slack bot, and he (yes, he’s part of the Otonomo family) is responsible for allocating development environments to engineers. One of the issues Bobby solves is multiple engineers working on the same environment. With dozens of dev environments, its very easy to ruin someone elses work by deploying different code/change to an occupied environment. His second ability is to preserve environments at night, in cases we do not want to delete a specific environment. For instance, a massive infra change was made on an environment but the work is still not completed- we do not want all of those changes to go to the trash at night. Bobby also indicates the status of an environment to the engineers, and if it’s faulted during the nightly process it will be marked as such. Once someone tries to deploy to a locked environment, there will be an error. Welcome Gandalf This slack/github bot is responsible for managing all interactions related to automatic pull request merges. By using gandalf, we insert PRs to a queue where each PR is deployed to the dedicated environment (and builds only the changed code and its dependencies) in a serial manner. Once the deploy is completed successfully it fires the merge command to github. In order to increase transparency, we post each merge to master on a dedicated slack channel. This helps us quickly understand branches were merged, and sometimes which related change was made. This is very useful when debugging, for example.
Technical challenges we overcame when implementing CI/CDCreating and implementing the process wasn’t always easy. Here are the main technical challenges we had to deal with, and how we overcame them.
API throttlingWe use the python AWS api (boto3) for our environment creation. Once we started to run dozens of processes in parallel, the number of errors (throttle errors) increased, making our nightly process very flaky. To overcome this, we reduced the number of API calls to the minimum by optimimzing the call-request ratio. In addition, we also use a @retry decorator, which catches throttle exceptions and retires in an increasing amount of time until it exhausts.
Package dependenciesThe parallel package building process can be risky due to dependencies. If package A is dependant on package B, you cannot build them in parallel because you B to be built first. This slows down the process or could cause errors. To overcome this, we always built the common packages first, and then we built the packages that have dependencies.
Resource mapping challengesResource and dependency mapping is hard to maintain manually. As of now, we have to maintain a list of our resources and their definition. To overcome this, we will dynamically generate mapping using smart templating and some sort of key/value store or framework of service discovery.
To summarize our process, we:
- Implemented an automated process to destroy and create development environments that ensures all envs are clean every day and reduces cost.
- Save cloud development costs by deleting resources during nights and weekends.
- Are able to apply cross changes without interference – Nightly will do it for us.
- Can lock an environment and work on it safely without interruptions from other engineers
- Automatically manage the merge queue by a bot.