As DevOps team leader, I view the company’s engineering team as our “customers״. Since time is a very important resource for them (and for everyone for that matter), we set out to help them save some. After some research, we recognized that a lot of time is wasted in the process of bringing the work-in-progress code to the main branch in our company’s version control (we use Git). So, we decided to find a solution for them which turned out to be a complete CI/CD pipeline dedicated to easing and shortening the development process.
This blog post focuses on how we built a CI/CD pipeline for development which is different than building one for production. While the processes are similar, not much information is available online about bringing the code to the master branch. Here at Otonomo, the dev pipeline includes working with remote development environments, merging to mainstream strategy and making the process efficient for around 30 engineers.
When Otonomo was a small company, it was quite easy to manage our development resources and environments. Every engineer had his or her own full environment (which was mostly under-utilized and idle) and we had a very small number of packages/services to build and deploy.
As we grew, challenges started and it became financially inefficient to maintain the same number of cloud resources as we previously had. As a result, we lacked enough environments and our productivity decreased. We answered the need by sharing dev environments. However, once we did, problems started sprouting like mushrooms after the rain.
Sharing environments requires constant maintenance and can lead to mistakes. Engineers have to always make sure the env they’re using is “clean” by deleting previous settings, structures, DB changes, infra modification, etc. This requires time and a decent amount of knowledge. If it is not done properly, it will cause errors.
To overcome the problems of shared envs, we had to increase the number of environments we were using, which lead to very high bills. This made even less financial sense, considering that almost 50% of the time, the environments were not being used (nights, weekends).
Applying changes, for example, adding a new cloud resource or a DB schema, consumed time and resources. Even with automated scripts.
Merges to the mainstream were made without fully checking end-2-end compatibility with all services, and problems got worse with infrastructure-related changes. In fact, many bugs and errors were discovered *after* the code was merged to master, either by chance, or through tests we ran over staging and production (the discussion about unit tests/integration is out of scope for this, but we do have those).
The answer we found to these problems was creating a CI/CD pipeline built out of 3 blocks (layers) that eases the whole environment management requirements (infra-wise) and merges clean, new code to the master.
In addition, by deleting environments at night, we reduce around 40% of the cost of each environment. During weekends for instance, we keep only 2 environments for emergency use.
Here is a sample daily cost of dev environments, specifically AWS EC2 and Elasticsearch:
Once someone tries to deploy to a locked environment, there will be an error.
This slack/github bot is responsible for managing all interactions related to automatic pull request merges. By using gandalf, we insert PRs to a queue where each PR is deployed to the dedicated environment (and builds only the changed code and its dependencies) in a serial manner. Once the deploy is completed successfully it fires the merge command to github.
In order to increase transparency, we post each merge to master on a dedicated slack channel. This helps us quickly understand branches were merged, and sometimes which related change was made. This is very useful when debugging, for example.
Creating and implementing the process wasn’t always easy. Here are the main technical challenges we had to deal with, and how we overcame them.
We use the python AWS api (boto3) for our environment creation. Once we started to run dozens of processes in parallel, the number of errors (throttle errors) increased, making our nightly process very flaky.
To overcome this, we reduced the number of API calls to the minimum by optimimzing the call-request ratio. In addition, we also use a @retry decorator, which catches throttle exceptions and retires in an increasing amount of time until it exhausts.
The parallel package building process can be risky due to dependencies. If package A is dependant on package B, you cannot build them in parallel because you B to be built first. This slows down the process or could cause errors.
To overcome this, we always built the common packages first, and then we built the packages that have dependencies.
Resource and dependency mapping is hard to maintain manually. As of now, we have to maintain a list of our resources and their definition. To overcome this, we will dynamically generate mapping using smart templating and some sort of key/value store or framework of service discovery.