DevOps / Sys Admin Q & A #10 : Trouble Shooting
What does performance mean to us?
To troubleshoot performance issues of our system, we need to understand the relationship between load, throughput, and response time.
- Load is where it all starts. It is the demand for work to be done (for example, 500 queries per second). Without it, we have no work and, therefore, no response time. This is also what causes performance to degrade. At some point, there is a greater demand for work than
the application's capacity of delivering, which is when bottlenecks occur.
- Throughput (for example, 1,000 requests per second) is the execution rate of the work being demanded. The relationship between load and throughput is predictable. As load increases, throughput will also increase until some resource gets saturated, after which throughput will get plateaued. When throughput plateaus, it's an indicator our application scales any more.
- Response time (msec) is the side effect of throughput. While throughput increase proportionately to load, response time will increase negligibly, but once the system reaches throughput plateau, response time will increase exponentially with the telltale "hockey stick" curve as queuing occurs.
Source: reference 1
The relationship between load and throughput becomes increasingly important in complex multi-tier applications. When a throughput plateau occurs, it may be visible across the multiple tiers simultaneously as shown in the picture below:
We could blame downstream items in situations like this, because poor performance downstream usually bubbles upstream. That could be a proper assessment for response time, however, it does not always apply to throughput.
By comparing throughput to load at each tier, we can identify what the root cause is. Here's one possible scenario corresponding to the throughput above, where the load scales linearly at each tier. In our case, since the load actually did increase continually at the database tier, we can safely identify that the bottleneck is indeed on the downstream database tier.
We may have another scenario where the app server load increases linearly, but does not propagate to the database tier as shown in the picture below:
So, the bottleneck is not in database tier but actually in the application server, which is not passing the load down to the database server. Note that load is the demand for work, and at some point the database is not being asked for anything additional. Without additional load, we won't have additional throughput.
Note: this section is based on ref #1.
"A leak occurs whenever an application uses a resource and then doesn't give it back when it's done. Possible resources include memory, file handles, database connections, and many other things. Even resources like CPU and I/O can leak if the calling code encounters an unexpected condition that causes a loop it can't break out of, and then processing accumulates over time asmore instances of that code stack up."
Do we a have a leak?Here are some signs that we most likely have a leak, or some leak-like behavior, in our application:
- App gets progressively slower over time, requiring routine restarts to resolve.
- App gets progressively slower over time, but restarts don't help.
- App runs fine for a few days or weeks and then suddenly starts failing, requiring a restart.
- Can see some resource's utilization growing over time, requiring a restart to resolve the issue.
- Haven't made any changes to the app's code or environment, yet its behavior changes over time.
We can monitor the heap utilization (heap dump) of each app's object in real time and trend it historically:
We can see the blue and green objects have some issues of leaking.
- Hidden in Plain Sight: Practical Tips for Detecting and Fixing the Underlying Causes of Common Application Performance Problems
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization