Distributions of time spent in different areas. We need to collect data on the parts of code that take up the most time.
- What functions get called
- How long do functions take to run
- What is using memory
Console Profilling
This is like printf debugging and profiling. This kind of works but is generally considered “invasive profiling”. We need to do a lot of manual accounting and this doesn’t really scale.
Tools
The idea of using profiling tools is to have a systematic way to measure performance without needing to be a wizard 10x dev.
Profilers
- Flat Profiler
- Call Graph Profiler
Data Gathering
These tools get data in 1 of 2 ways: - They can sample at a frequency
- Add instrumentation at certain program points like conditional breakpoints at compile or run time
Guide to Profiling
- Write clear and concise code
- Profile to get a baseline of performance Good signs
- Make sure time is spent in the right place of the system (know what the right place is)
- Time should not be spent handling errors
- Time is not unnecessarily spent in the OS
Summary
You can profile systems in dev but it might not help with production scale or complexities. But, profiling must not impact performance.
ECE459
We use perf which is an API to the Linux kernel.
Consistency
The profiler is the prosecutor. We need to collect supporting evidence to make sure that we end up with the correct narrative instead of confirmation bias. We want to create microbenchmarks:
- Memory access to uncached locations or computations
- Use
perfto evaluate impact ofmfencevslock(in the lecture example) - We want to make sure we look at the total number of cycles. When we benchmark to see the proportion of the overhead of our system (%) we also care about checking how long it takes out program to run. If our overhead percentage goes down but our program gets slower, this is obviously worse.
Skid
Another cause for the wrong attribution to instructions being expensive is skid which is the time between when the counter overflows and when the actual sample is taken.
Tail Latency
When we have a task getting distributed over multiple machines, we can only go as slow as the slowest step.
Note
perfsamples are also done with interrupts which are slow. If you increase the rate at which these fire, you spend more time managing the interrupts rather than work. SHIM gets around this by being invasive.
Counters
There are things you can do to make counters more deterministic:
- Disable address space layout randomization (important for making sure that if you over-read allocated memory, you don’t read sensitive things that you shouldn’t, or for writing instructions to overflowed memory)
- Subtract time spent processing interrupts
- Profile one thread at a time
Context
Tools can lie to us as well, gprof uses two library functions profil and mcount. profil records which instruction is running at 100x per second whereas mcount records call graph edges. This makes two unequal functions int terms of CPU usage seem like they take equal amounts of time since time will be distributed proportional to call counts.
Summary
- Results are exact, sampled, interpolated. Understand your tools!