How we shorten our nightly build time and enhance stability using TeamCity RESTful API
“What’s the first thing that comes to mind when you think about CI?”
If you are a developer, you probably think it should be FAST;
If you are an automation engineer, you probably think there should be lots of tests;
And if you are a DevOps engineer you ARE worried.
Back in 2016, we faced a problem. Our CI process became madness. Out of a culture where every small feature and bug needed to be automated, the type of culture where automation tests are a MUST and new infrastructure is added on a monthly basis, we found ourselves buried under a heap of tests.
Combining our top 5 automation infrastructures, we had more than 10K tests, that would have taken over 20 hours to run (had they run on a single CI agent). The need to cover so much ground forced us to write a powerful UI/Web test framework to allow us to easily test our full stack. However, despite the many enhancements we made, the execution of each web test still took anywhere between 1 and 5 min.
Typically, when you think of a testing pyramid, you want to have a solid base of unit-tests, reasonable number of integration tests and a few high-quality web tests. In our case, with the power of super-infrastructure, on came the problem — a huge amount of web tests almost preventing our nightly suite from running on our JetBrains TeamCity (TC) instance. Back then, we had about 10 agents, and probably a dozen build configurations in TC, each running a specific type of tests, for a long time. For example, running our integration test assembly took 3.5 hours, while our web test suite took 5 hours.
A minimal number of build configurations sounds like a best practice when having a huge amount of tests, but we faced several problems:
Only one shade of red: The build was always red, full of unstable tests, execution failures, infrastructure issues, etc. The build never reflected the actual status of the product or the progress of the current iteration.
Extremely long nightly build, or should we say daily? We even considered only running it 3 times a week.
Execution time: Each build execution took anywhere between 1 to 5 hours.
Stability: It was impossible to identify unstable tests when the minimum cycle time is 5 hours.
Limited number of build executions: Infrastructure errors prevented entire build executions from running during the nightly. Losing one nightly build result can be crucial.
Unusually long response time for private builds: Developers had to wait a day or two to get execution results from TeamCity when changing code in a specific area of the product, even before committing the code to the main repository.
Build agent inaccessibility: Build agents were busy most of the time. As a result, other builds had to wait for long times in the queue.
The Beginning of Solution
We needed a solution or at least a way to provide the CI consumers a fresh breeze. A solution that would make their life easier, while achieving maximum stability and quick response time.
So we decided to SPLIT. Splitting everything. Class by class. web, integration, unit-tests. Instead of having less than 15 build configurations in TC, we would have more than 200.
Although TC easily handles 200 builds, we couldn’t manually deal with this setup, considering we had 4 branches in development, 3 DevOps engineers, and just 10 agents trying to keep up with the pace. Managing the new build configurations became our day today. As a result, a few days after the initial split we found ourselves in chaos. And as usual, when you don’t have enough, you need to automate what you have.
Splinter - our personal splitter. We realized we needed someone to keep up with the developers, review the changes in the code, observe new classes and new tests, and introduce them to TC while removing previously deleted ones. Time to write some automation.
Here’s the basic concept:
The Splinter utility should support any test assembly from our CI process and do the following:
1) Using reflection, analyze the assembly and find the relevant test classes. Test classes that are ignored, empty or contain only ignored tests wouldn’t be considered.
2) Clone an existing TC build configuration template and set the correct parameters in the new build configuration. If a matching build configuration already exists, just move to the next one in line.
3) Trigger the build configuration.
4) For cleanup, remove any build configurations that don’t have a matching class in the code.
Master Splinter (or Splinter Manager) is triggered every night and initiates all other Splinter, each for a specific test assembly.
In Quali, we write most of our infrastructure code in C#. In the Splinter project, we couldn’t find a library that wrapped TeamCity’s rich RESTfull API for C# for advanced queries, so we decided to develop one — FluentTC.
FluentTC — Easy-to-use, readable and comprehensive library for consuming TeamCity REST API. Written using real scenarios in mind, enables various range of queries and operation on TeamCity
TeamCity’s rich RESTfull API allowed us to implement Splinter easily and add custom logic and enhancements in the following versions of Splinter.
Introducing Splinter has created a new CI culture:
Agent utilization is amazing. Almost at any time, a developer can find a free slot to execute a build.
Nightly can be executed every 10 hours and still allow quick TeamCity response for developers during their workday commits.
Adding more agents to the agent pool immediately decimates the nightly execution time. Instead of having huge builds that are executed on a single agent, one at a time, Splinter allows the distribution and execution of the builds between available agents, in parallel.
Unstable tests are quickly detected, allowing for prompt investigation while unstable test classes can be executed in short intervals to collect actionable statistics.
Two-phase check-in made easy: Using TeamCity ReSharper integration for Visual Studio, developers can now select a specific set of test classes to execute with their private build, so areas affected by the code changes are fully tested before committing the code to the main repository. Response time for two-phase check-in is not likely to exceed 15–20 min.
Managers can enjoy a bird’s eye view of the product status and stability and quickly pinpoint defective areas in the product.
Quali is the industry leader in delivering cloud-agnostic Environment as a Service (EaaS) solutions for development and testing, sales demo/POC, training, and cyber range teams. Global 500 OEMs, ISVs, financial services, retailers, and innovators everywhere rely on Quali’s award-winning CloudShell platform to create self-service, on-demand environments that cut cloud costs, optimize infrastructure utilization, and increase productivity.