Bottlenecks

Bottlenecks are points of congestion in a system which reduce throughput. Usually I’m concerned with device or performance bottlenecks.

Finding Bottlenecks

Often we assume the CPU is the problem but is this always true?

Quote

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts (Confirmation Bias)

Sherlock Holmes I think

We need to collect evidence before we know who to blame. There are many points where we could have a bottleneck:

CPU
Memory
Disk
Network
Locks These are obviously categories.

Todo

Maybe move these into separate notes later?

Slow workflows could also be a bug (for example, what if we are doing an async task in a UI thread).

CPU

We can check htop, etc to see how much of the CPU the application is taking up. This only tells us an instantaneous stat instead of a long term average. Long term averages come in the form of 1, 5, and 15 minute averages of CPU load which is more important.

How do we interpret these numbers?

0.00 means no traffic
0.00-0.99 means we’re under capacity and there are no delays
1.00 means we are at capacity but there are no delays
Anything over 1.00 means there is a backup (delay)

>1.00 isn’t necessarily bad but you should be concerned if this is consistent. Its probably time to investigate if our capacity is >0.7ish.

Note

This is all for a single CPU btw, if your load is scaled by 4 and we have 4 cores in our CPU, this is chill.

Memory

Is the garbage collector running a lot?
Do we every crash because we run out of memory? We want to look at disk utilization to differentiate between OOM and disk problems. Not enough RAM → swapping which leads to bad performance and scalability. You can find this information using htop as well but this isn’t too interesting.

Note that memory being full isn’t necessarily a bad thing, it means the resource is being used to its maximum potential, there is no benefit to keeping a block of memory open for no reason.

Disk

Page Faults

We can ask about these using ps -eo min_flt, maj_flt, cmd

Major faults are ones where we had to fetch from disk
Minor faults are ones where we had to copy a page from another process
The output of this is quite large and this is process life-time data What is more interesting is a report on swapping vmstat 5 tells us which processes are doing swaps where we can see the swap in and swap out columns which tell us the size of the swaps.

We can use iostat -dx /dev/XXX 5 where we would sub in a hard drive to see how often a disk is used.

Network

We can ask about network using nload and we’ll get a summary of data in and out. If we see data leaving at 100 MBit, that would probably be a bottleneck. We can use tools like network speed tests to check network download and upload speed maximums.

You can use tracert address to show us the “hops” required to get to an address. Latency can never get down to zero and we can have packet loss with increases in distance (we need to wait for packets to arrive).

Locks

Maybe we have deadlock or our code is slow because we are waiting for locks. We can try to find lock issues by looking for:

Unexpectedly low CPU usage not explained by IO waiting
Many threads might be blocked?
There is no magic lock trace tool (perf lock is for kernel locks only, not application locks)

With all of this said, it probably is the CPU

Fixing Bottlenecks

This is way harder than finding them. Maybe the client just runs on old hardware.

🤖 Dan Huynh

Recent Notes

ECE459

Bottlenecks

ECE453

Automated Debugging

Counters

Explorer

Bottlenecks

Finding Bottlenecks

CPU

Memory

Disk

Page Faults

Network

Locks

Fixing Bottlenecks

Graph View

Recent Notes

ECE459

Bottlenecks

ECE453

Table of Contents

Backlinks