Bottlenecks are points of congestion in a system which reduce throughput. Usually I’m concerned with device or performance bottlenecks.
Finding Bottlenecks
Often we assume the CPU is the problem but is this always true?
Quote
It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts (Confirmation Bias)
- Sherlock Holmes I think
We need to collect evidence before we know who to blame. There are many points where we could have a bottleneck:
- CPU
- Memory
- Disk
- Network
- Locks These are obviously categories.
Todo
Maybe move these into separate notes later?
Slow workflows could also be a bug (for example, what if we are doing an async task in a UI thread).
CPU
We can check htop, etc to see how much of the CPU the application is taking up. This only tells us an instantaneous stat instead of a long term average. Long term averages come in the form of 1, 5, and 15 minute averages of CPU load which is more important.
How do we interpret these numbers?
- 0.00 means no traffic
- 0.00-0.99 means we’re under capacity and there are no delays
- 1.00 means we are at capacity but there are no delays
- Anything over 1.00 means there is a backup (delay)
>1.00 isn’t necessarily bad but you should be concerned if this is consistent. Its probably time to investigate if our capacity is >0.7ish.
Note
This is all for a single CPU btw, if your load is scaled by 4 and we have 4 cores in our CPU, this is chill.
Memory
- Is the garbage collector running a lot?
- Do we every crash because we run out of memory?
We want to look at disk utilization to differentiate between OOM and disk problems. Not enough RAM → swapping which leads to bad performance and scalability. You can find this information using
htopas well but this isn’t too interesting.
Note that memory being full isn’t necessarily a bad thing, it means the resource is being used to its maximum potential, there is no benefit to keeping a block of memory open for no reason.
Disk
Page Faults
We can ask about these using ps -eo min_flt, maj_flt, cmd
- Major faults are ones where we had to fetch from disk
- Minor faults are ones where we had to copy a page from another process
- The output of this is quite large and this is process life-time data
What is more interesting is a report on swapping
vmstat 5tells us which processes are doing swaps where we can see the swap in and swap out columns which tell us the size of the swaps.
We can use iostat -dx /dev/XXX 5 where we would sub in a hard drive to see how often a disk is used.
Network
We can ask about network using nload and we’ll get a summary of data in and out. If we see data leaving at 100 MBit, that would probably be a bottleneck. We can use tools like network speed tests to check network download and upload speed maximums.
You can use tracert address to show us the “hops” required to get to an address. Latency can never get down to zero and we can have packet loss with increases in distance (we need to wait for packets to arrive).
Locks
Maybe we have deadlock or our code is slow because we are waiting for locks. We can try to find lock issues by looking for:
- Unexpectedly low CPU usage not explained by IO waiting
- Many threads might be blocked?
- There is no magic lock trace tool (
perf lockis for kernel locks only, not application locks)
With all of this said, it probably is the CPU
Fixing Bottlenecks
This is way harder than finding them. Maybe the client just runs on old hardware.