Canaries of the performance wharf : Testing for performance in a multi-cloud environment
Once a year just before the holidays, at the tail end of third quarter, retail companies are abuzz with activities like - final changes to code, code freeze, performance tests, heated conversations, daily standups, and the countdown to Black Friday, Cyber Monday and so forth. Post pandemic times, eCommerce is at the forefront generating significant revenue during the holidays. This translates to applications having to handle unprecedented transaction volumes during the holidays.
In a hybrid environment with few applications on the cloud, one as-a-service and the rest on-premise, tuning every application to its optimal performance is a very challenging task. What makes this challenge even more daunting is the fact that it is not very uncommon to see varying performance results for the same transaction volumes at different times of the day, or week. Performance architects often find themselves scratching their heads trying to find the root cause. Very soon the root cause analysis cycles become root cause devising cycles when one of those verbose architects ties bad performance issue to the outside weather 😊.
However, with little precaution and forethought the performance engineering teams can stay relaxed during peak traffic of the holiday season. This precaution comes in the form of devising canaries to understand the performance of various environments where the applications reside. And the forethought is a well-thought performance test scenarios basis thorough understanding of various tiers of the applications and how they communicate with each other.
Little anecdote on canaries 🐦: Back in the day in the mining industry, as they dug further and further through dynamites, it was important for the miners to ensure the newly dug space is habitable and is free from any poisonous gases. Therefore, they inserted a cage containing a canary into the new channel and if the canary came back alive, that meant they could go and dig further else they needed a course correction.
Application performance engineers are no less than any miners since they are constantly mining for better and best performance from a given application architecture. Performance architects must therefore draw learnings from the mines and insert the canaries into application environment to measure the performance. This will determine whether the current architecture can withstand the forecasted volumes for the upcoming holidays or if it needs a course correction.
In a cloud environment, this step will ensure if the entitlements they have signed-up for are yielding consistent performance or not. Canaries must be inserted therefore at every tier of the application. The illustration provides the details of various components of an application that will influence the performance, and which is why the architects must consider inserting a canary.
So, what are these canaries in context? These are programs that employ a fixed amount of compute and memory to complete the task. Monte Carlo simulation to compute the value of pi is a classic example. For a given server configuration – 2 cores of 2.5Ghz, this program will consume a fixed amount of CPU and memory that needs to be benchmarked. Benchmarking is a onetime task, that must be repeated for different type of server configurations.
The next step is to run these programs across all similar servers for an application. The difference in performance if any, can then be easily measured. Architects must run these programs at different times of the day and document the performance. Often, I have noticed that, for a given application, different cores with the same configuration gave you different performance results at different times of the day.
In essence, architects may have done a meticulous job in defining and non-functional requirements and sizing the application, but the performance today may vary due to some extraneous factors because of how the environments where the applications are hosted to have changed – cloud and datacenters alike.
With the advent of virtualization, the processor core is now a sliceable entity. When the architects specify the provisioning requirements, it is only defining the number of processor cores of a specific configuration on a particular operating system. Architects have no control over how, say, a dedicated server instance with two cores with 2.5Ghz each were provisioned to the application on a public cloud/datacenter. The areas where the architects have no control are the following – was it a whole processor cores? or 1/10th of 10 processor cores? where is this core located? what other applications are concurrently running on the box where this core is? These varying factors typically translate to varying performance. Therefore, canaries.
The next step would be, to document these results thoroughly and, for an on-premise application, architects must interact with the infrastructure team to understand where the servers are and what other process or application is scheduled to run on the same server. This is a very important step. In case of applications on the cloud, architects must call the cloud provider and discuss these results and ensure the terms of your cloud agreement is met. It doesn’t hurt to keep the management informed on the findings and the on-going discussions with the infra and the cloud teams.
Based on the outcome of these discussions’, architects must ensure the application server cores are wholesome, exclusive, and not shared with any other compute or data intensive application. Lastly and importantly, architects must repeat the exercise of running the canaries again once the corrections are made. Performance scenarios must not be executed until this preliminary and precautionary step of running canaries is taken, discoveries made, and the corrections incorporated.
For the applications even in production, where the performance keeps varying despite all the tuning and tweaking, canaries come very handy.
So, architects, now will you mobilize your team and have them write a quick canary, insert in every tier, and measure the performance? You must, in my opinion make this a pre-requisite to any performance test exercise.