Distributions of time spent in different areas. We need to collect data on the parts of code that take up the most time.

  • What functions get called
  • How long do functions take to run
  • What is using memory

Console Profilling

This is like printf debugging and profiling. This kind of works but is generally considered “invasive profiling”. We need to do a lot of manual accounting and this doesn’t really scale.

Tools

The idea of using profiling tools is to have a systematic way to measure performance without needing to be a wizard 10x dev.
Profilers

  • Flat Profiler
  • Call Graph Profiler Data Gathering
    These tools get data in 1 of 2 ways:
  • They can sample at a frequency
  • Add instrumentation at certain program points like conditional breakpoints at compile or run time

Guide to Profiling

  • Write clear and concise code
  • Profile to get a baseline of performance Good signs
  • Make sure time is spent in the right place of the system (know what the right place is)
  • Time should not be spent handling errors
  • Time is not unnecessarily spent in the OS

Summary

You can profile systems in dev but it might not help with production scale or complexities. But, profiling must not impact performance.

ECE459

We use perf which is an API to the Linux kernel.

Consistency

The profiler is the prosecutor. We need to collect supporting evidence to make sure that we end up with the correct narrative instead of confirmation bias. We want to create microbenchmarks:

  • Memory access to uncached locations or computations
  • Use perf to evaluate impact of mfence vs lock (in the lecture example)
  • We want to make sure we look at the total number of cycles. When we benchmark to see the proportion of the overhead of our system (%) we also care about checking how long it takes out program to run. If our overhead percentage goes down but our program gets slower, this is obviously worse.

Skid

Another cause for the wrong attribution to instructions being expensive is skid which is the time between when the counter overflows and when the actual sample is taken.

Tail Latency

When we have a task getting distributed over multiple machines, we can only go as slow as the slowest step.

Note

perf samples are also done with interrupts which are slow. If you increase the rate at which these fire, you spend more time managing the interrupts rather than work. SHIM gets around this by being invasive.

Counters

There are things you can do to make counters more deterministic:

  • Disable address space layout randomization (important for making sure that if you over-read allocated memory, you don’t read sensitive things that you shouldn’t, or for writing instructions to overflowed memory)
  • Subtract time spent processing interrupts
  • Profile one thread at a time

Context

Tools can lie to us as well, gprof uses two library functions profil and mcount. profil records which instruction is running at 100x per second whereas mcount records call graph edges. This makes two unequal functions int terms of CPU usage seem like they take equal amounts of time since time will be distributed proportional to call counts.

Summary

  • Results are exact, sampled, interpolated. Understand your tools!