What is Performance Testing?
Performance testing is very complex. This article breaks down performance testing's components, explains PSR, and looks ahead to Testaify.
TABLE OF CONTENTS
- Part 1 - It is Time to Talk About Performance
- Part 2 - How to Do Performance Testing
- Part 3 - Performance Testing: What about Scalability, Stability, & Reliability?
- Part 4 - Back in the ‘SSR [the PSSR, that is].
We recently introduced the concept of Continuous Comprehensive Testing (CCT), and we still need to discuss in depth what that means. This series of blog posts will provide a deeper understanding of CCT, focusing on performance testing.
In our introductory CCT blog post, we said the following:
Our goal with Testaify is to provide a Continuous Comprehensive Testing (CCT) platform. The Testaify platform will enable you to evaluate your software product through:
- Functional testing
- Usability testing
- Performance testing
- Accessibility testing
- Security testing
While we cannot offer all these perspectives with the first release, we want you to know where we want to go as we reach for the CCT star.
It is time to talk about application performance testing.
Like functional testing, performance testing is complicated, and it has a lot of variables. While the industry agrees on specific aspects of performance testing, the devil is always in the details.
Wikipedia defines performance testing as “a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability, and resource usage.” The definition seems familiar because almost every vendor or thought leader who writes about performance testing copies it from Wikipedia.
Performance testing is an umbrella term. While there is some disagreement about the types of performance testing, most will include the following list.
Types of Performance Testing
- Load Testing – testing the system under specific expected load conditions. Usually, that means “peak load.” Yes, I used double quotes. You will find out why.
- Stress Testing – customarily used to identify the upper limits of the system's capacity.
- Spike Testing – as the name suggests, this test increases the system's load significantly to determine how the system will cope with those sudden changes.
- Endurance Testing (sometimes called Soak) – usually done to determine if the system can sustain the continuous expected load.
- Scalability Testing – This testing aims to understand at what peak the system prevents more scaling.
If you check multiple sources regarding performance testing types, you will see other types like volume, unit, breakpoint, or internet testing. While some of those names are more appropriate than others, most of them do not matter. What matters in performance testing are the quality attributes you are trying to measure.
Do you do software performance testing with PSR?
At Ultimate Software, my friend Jason Holzer developed the acronym PSSR. Most people kept pronouncing it with one “S,” so it became PSR. The Acronym stands for Performance, Scalability, Stability, and Reliability (PSSR). He did it because what we care about with performance testing is answering these questions:
- Performance – Can the system provide an acceptable response time with no errors and efficient use of resources?
- Scalability – At what load does the system stop having an acceptable response time with no errors?
- Stability – How long can the system provide acceptable response time with no errors for a significant period without intervention?
- Reliability – How reliable is the system after months of use without intervention?
These four letters match the four different tests we ran. Each one served as a gate. Each one provided an answer to one of our questions.
Now, we are getting into the performance testing details. Notice I did not mention “peak load.” I always find the idea that companies know the performance requirements of their product hilarious. I have never seen a product team provide that information in my career. How do you know what the “peak load” is? Instead, we try to find the actual load by defining a test based on response time and reliability that the system must meet.
Let’s keep getting deeper into the performance testing details. The first key question is: What is an acceptable response time? The second question is: How do you test for it?
Acceptable Response Times and How to Test for It
The first performance testing question has an answer that most people start with. It all begins with a 1968 test conducted as part of human-computer interaction research. Those results have become the following advice:
The basic advice regarding response times has been about the same for thirty years [Miller 1968; Card et al. 1991]:
- 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
- 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
- 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
Here is the link if you want to read about it: https://www.nngroup.com/articles/response-times-3-important-limits
Some have suggested that the third limit has changed through time as computers become omnipresent in our lives. In the early days of the internet, the 8-second rule became famous. Today, some organizations use an upper limit of 5 seconds instead. Others ignore the upper limit and focus only on the first two lower limits.
When we started, we used the Miller results. We kept the 10-second upper limit. Eventually, I replaced it with a 5-second limit. You can test a single user and see if it stays within the upper limit. Still, if your system can only handle one user in 10 seconds or less, your system will fail in the marketplace.
That means you must define a load test and, more importantly, determine how many concurrent users you can support. But as the saying goes, the devil is still in the details. What are concurrent users?
We will define simultaneous, concurrent, and active users in an upcoming post. We will also talk about how Testaify sees performance testing. Spoiler alert: It heavily depends on our work experience at Ultimate Software.
Special note for those who enjoy our content: Please feel free to link to any of our blog posts if you want to refer to any in your materials.
How to Do Performance Testing
QUICK LINKS
- Part 1 - It is Time to Talk About Performance
- Part 2 - How to Conduct Performance Testing
- Part 3 - Performance Testing: What about Scalability, Stability, & Reliability?
- Part 4 - Back in the ‘SSR [the PSSR, that is]
In this blog post, we continue our discussion about performance testing. Our previous blog post on performance testing discussed defining a test. That test must answer the following question: Can the system provide an acceptable response time with no errors and efficient use of resources?
One of my most significant issues with how the industry treats performance testing is how it pushes the specialists in the field to become tool users who provide data without any analysis except what is in the tool. It is incredible how much money you can waste on performance testing if the person using the tool is a glorified test runner.
At Ultimate Software, our PSR team had the reputation of being the smartest guys in the company. We move away from tool experts to performance test engineers. Our team knew how to use the tool, but more importantly, they could dig deep to develop their analysis and provide valuable feedback to the software engineers.
As such, we had a very opinionated team. Because Ultimate Software was one of the early SaaS companies, we had to support our infrastructure (before AWS existed). We were obsessed with reducing TCO (Total Cost of Ownership).
We designed our tests in that direction and concluded that certain industry practices made no sense in that pursuit. In particular, we did not believe in using tests with active users.
Let's provide a performance testing definition for two concepts.
Simultaneous users are all executing the same transaction at the same time. For example, if you have 100 users, all log in simultaneously. In that case, you are using simultaneous users. More specifically, this test will use a rendezvous; users will stay in sync and wait for all other users so they can all work simultaneously. This type of test is advantageous to identify issues in specific transactions. We use it often, especially on log-in and landing pages. You want your first impression to be great.
Concurrent users are all executing a transaction at the same time. For example, ten users are logging in, 15 are checking their paystub (our suite was HR, Benefits, Payroll, etc.), 20 are entering PTO, etc. The key is that all users simultaneously do something, even if it is not the same thing. We love concurrent users. It allows us to test the whole system and quickly evaluate its architecture.
Finally, the industry popularized certain types of scenario testing using active users. Active users are all in the system at the same time, but they are not all doing something at the same time. These tests will use random wait times (usually between 1 and 30 seconds). The argument is that not all users will do something simultaneously in real life. Some will think about what they will eat for lunch or drink coffee. Perhaps, but as a performance test, using active users is useless.
DIY due diligence with performance tests.
As a technology executive, I had to perform due diligence on products we considered buying and adding to our portfolio. We always asked for performance testing results (ideally, we wanted to run our own). Whenever a vendor provided a performance testing paper with a scenario test using active users (wait times), I knew their product architecture was terrible. Nothing reveals the quality of the architecture of a system like performance testing. But if you want to hide those issues, drink the industry Kool-aid and use active users.
For us, only performance tests using simultaneous or concurrent users (no wait times) are worth doing. The rest is just smoke and mirrors. As I said, we were an opinionated group. The objective of testing is to reveal problems, not to hide them. I might be showing my age here, but if you are a fan of Seinfeld, you remember the episode The Hamptons (the original title was The Ugly Baby). Yes, that is the one. Performance testing is about radical honesty; if the baby is ugly, you must tell them.
Testing a Replica?
Another industry practice for performance testing with little value is the idea of testing in a replica of your production environment. First, how do you know what the production environment requires? Of course, I forgot the architect who designed the system told you. Performance testing is about testing the architecture, so by definition, we are trying to break the architect’s work. Instead, we test with the smallest possible resource footprint. The company's objective is to make money, and we help them do that by reducing the TCO, not trying to make the architect happy. Remember, it is the architect’s baby. If it is ugly, you have to tell him. Besides, the results from the PSR team tests will tell us what the production environment should look like, not vice versa.
We started our performance testing with load tests using the smallest setup of resources needed. If the system had web servers, app servers, and database servers, then we test with one web server, app server, and database server. We wanted the least amount of stuff between our test and the code. These days, with serverless architecture, you will get the actual cost of the test. It's easier, as you just have to reduce that amount as much as possible to improve your TCO.
For single performance testing transactions like logging in, we use simultaneous users. For the whole application, we use concurrent users. We also target 100 users passing to say an application was well designed. You will not believe all our fights with engineering when we picked that number.
In other words, to say a system meets the minimum requirement, you have to support 100 simultaneous users per transaction with a 90 percentile response time below 10 seconds, no errors, and no excessive use of resources (85% or less).
Why 90 percentile? Hopefully, you know the answer, but the average is only 50% of the users meeting the criteria. You want most of the users to meet the requirements. Another important lesson is that if someone gives you results using averages, feel free to slap them. Also, provide them with a copy of The Flaw of Averages book. Another sign of a performance paper trying to hide something is only reporting average results.
What about the other questions? “You need to tell us the whole PSR process.” Relax; that information will come in the next blog post. Stay tuned!
Part 3 - Performance Testing: What about Scalability, Stability, & Reliability?
QUICK LINKS
- Part 1 - It is Time to Talk About Performance
- Part 2 - How to Conduct Performance Testing
- Part 3 - Performance Testing: What about Scalability, Stability, & Reliability?
- Part 4 - Back in the ‘SSR [the PSSR, that is].
Our previous blog post on performance testing discussed defining a test. That test must answer the following questions:
- Performance – Can the system provide an acceptable response time with no errors and efficient use of resources?
- Scalability – At what load does the system stop having an acceptable response time with no errors?
- Stability – How long can the system provide acceptable response time with no errors for a significant period without intervention?
- Reliability – How reliable is the system after months of use without intervention?
In this blog post, we will continue analyzing performance testing.
Performance: Can the system provide an acceptable response time with no errors and efficient use of resources?
Before we move from the “P” to the two “S”s and the “R,” let me add one more thought regarding the first question about the system providing an acceptable response time with zero errors: As discussed in my previous blog post on performance testing, we evaluate products using the smallest footprint possible. We also came up with an approach that focuses on breaking the system.
We do this because performance testing is about critiquing the product (you can learn about Marick’s testing quadrants in this post). At Ultimate Software, we used a threshold to evaluate products—that famous 100 simultaneous users for single transactions and 100 concurrent users for the whole system. And we do not use “think times.”
To be completely transparent, most of our tested products did not meet this threshold. Less than 20% of the products we tested met that threshold. The ones that did meet this threshold had the highest margins because the TCO (Total Cost of Ownership) was low. We made a lot of money on those products.
However, many did not meet the threshold. In many cases, the company acquired them (the whole company or the codebase), even when we, the engineering team, told the leadership team not to buy them. In one instance, Ultimate Software acquired a company for one of its talent management products. That product could not cross 25 users. After a difficult period in production, the company decided to rewrite the product.
It’s essential to understand the business perspective when doing performance testing. In my experience, I know you can have a successful business with a product that only reaches 35 users in our test. Your TCO will be very high, but you can make a lot of money if you have little competition in a niche market. Technical debt is usually a slowly growing curve. You can throw resources at it until you hit the critical threshold of scalability. Eventually, that technical debt will get you and completely slow down your business unless you aggressively address the issues. The longer you wait, the more it is going to hurt.
Because the company decided to buy many products against our recommendation, we had no choice but to figure out the other quality attributes of those products, too. Even our homegrown product had a difficult journey trying to meet the threshold. Besides, there is a big difference between a product that gets to 25 users versus one that gets to 75 users.
As such, we had to answer other questions, but we’ll pause here. You know you love a good performance testing cliffhanger.
Part 4 - Back in the ‘SSR [the PSSR, that is].
QUICK LINKS
- Part 1 - It is Time to Talk About Performance
- Part 2 - How to Conduct Performance Testing
- Part 3 - Performance Testing: What about Scalability, Stability, & Reliability?
- Part 4 - Back in the ‘SSR [the PSSR, that is].
It is time to talk about Scalability.
Some will say: “You already answered that question with the first test,” but that is not entirely true. (What first test? That famous 100 simultaneous users for single transactions and 100 concurrent users for the whole system.) While we are not fans of testing on a replica of the production environment, you do have to model and figure out what that production environment will look like and how the system will scale. In other words, performance testing should drive how the production environment should look, not vice versa.
To answer the other questions, we model our performance testing environment to include all the typical production environment components, like load balancers and other networking components. We created clusters for our database servers, etc. In other words, we mimic the traditional trappings of a production environment.
One significant difference was our storage subsystem. Because we want to optimize for running tests as often as possible, we can restore the whole database and file management systems in seconds. We used a completely different storage system than what was in production.
One of the biggest challenges in software testing is always test data. This issue is particularly true for performance testing. After creating our baselines, we wanted to save time recreating them every time we tried to run another test. Most of the code we wrote was to help us create test data.
Today, thanks to IaaS offerings like AWS, Azure, and GCP, you can use code for all your infrastructure needs and turn the environment off when you do not need it. You can copy it if you need to. Infrastructure as code is a beautiful thing.
Using this environment and the information we learned from our previous test, we can create a test to see how the system handles scaling. For example, if our results show the environment cannot cross 75 users, we will test with 75 concurrent users. We can then double the number of users and resources (usually web servers and app servers) and see if the behavior at 150 concurrent users matches the expectations. We can continue this process until we find the scaling bottleneck.
Let’s assume response time experiences a significant slowdown when you test at 1200 concurrent users. An analysis of the issue seems to point to the database, as resource utilization is average in the other system components. Thanks to the quick restore capabilities of our storage subsystem, we can run the test with 1150 concurrent users, then 1100 concurrent users, and eventually with 1125 concurrent users. We confirm the issue occurs between 1100 and 1125 concurrent users. We now have a very narrow range that defines the scale of the environment.
Because we had a lot of production data, we also know that our concurrent users' numbers map to a 10x ratio of active users for this application. In other words, the issue will likely happen in production when it reaches between 11000 and 11250 active users.
I will not profoundly dive into the analysis to address that scalability issue. Most new offerings today implement horizontal scaling on all their layers, but many products still exist with the typical database bottleneck because of the limited horizontal scalability at that layer. Many organizations simply vertically scale their databases if they hit this issue. That is a temporary solution, but it might work for a significant period.
While that test answers one dimension, we must see if any hidden issues create stability problems.
Stability: How long can the system provide acceptable response time with no errors for a significant period without intervention?
We have a good idea of the response time and scalability thresholds. The next step is to figure out if there are any stability issues. Using the same environment for scalability, we can test for stability. To answer this question, we ran a test that generates a lot of activity in the system without breaking the other thresholds.
For example, suppose the response time limit is 75 concurrent users per smallest footprint, and the scalability limit is 1100 concurrent users. In that case, we can run the test respecting those limits for a long time to see if something shows up. We called this test “24 hours” because that is how long it ran. In those 24 hours, we generated the equivalent of months of usage.
Using that much data, we can see if the system has hidden issues like memory leaks that do not show up until you hit a certain number of transactions. When you run these endurance tests, you focus on measuring success or error rates. That is usually a percentage of the transactions. Once again, we are trying to critique the product by breaking it.
Your stable system will eventually stop being stable when the transaction success rate goes below 100%. That will provide your performance testing team with the stability threshold.
Reliability: How reliable is the system after months of use without intervention?
For the last question, we had a test called “48 hours.” It was similar to the “24-hour” stability test, but the objective was to go beyond the stability threshold and see if the system was still mostly functioning. In the “48 hours” test, we generated more than a year of traffic.
The objective was to find out how reliable the system is. Can the system recover if our success rate goes to 97% after 12 hours? Does it break further until it becomes unusable (success rate of 0%)? Does it keep degrading slowly (down to 90%, then 80%, etc.)?
All these tests guided our operations team and helped them plan for each product they had to support.
Our experience at Ultimate Software taught us to focus on providing business value. We did it by focusing on four key questions that define our PSR framework. Those questions allowed us to give all stakeholders a comprehensive picture of the product's performance and its limitations.
In the final blog post of this series, we will discuss how the lessons of many years of PSR testing define what Testaify's performance testing software platform will look like.
About the Author
Testaify founder and COO Rafael E. Santos is a Stevie Award winner whose decades-long career includes strategic technology and product leadership roles. Rafael's goal for Testaify is to deliver comprehensive testing through Testaify's AI-first platform, which will change testing forever. Before Testaify, Rafael held executive positions at organizations like Ultimate Software and Trimble eBuilder.
Take the Next Step
Join the waitlist to be among the first to know when you can bring Testaify into your testing process.