Processors

These are things that have cores that execute instructions.

The x86 processor is a complex instruction set computer processor. In other words, there are a lot of assembly instructions. This makes things hard to implement and pipeline for people other than assembly programmers.

Program performance

This was $number of page faults \times the amount of time it takes to read from disk + instruction execution time$

Therefore, an optimization goal was mainly to limit the number of page faults, or if you were working in an embedded system with no disk (so no page faults are possible), the goal is to minimize instruction count.

Eventually we got reduced instruction set computers (RISC)’s which scale easier and introduce more pipelining but were harder to program.

Barriers

Instruction level parallelism is getting close to the limit of what we can do.
The speed of memory advances has not at all kept up with the advances of CPU technology. Runtime is now dominated by cache misses instead of page faults.
We are approaching the universal speed limit (speed of light).

Pipelining

To complete an instruction, there are 5 basic steps:

Fetch an instruction from memory
Decode the instruction
Fetch needed operations
Perform the operation
Write the result

Pipelining is basically being able to decode and fetch and instruction within the same clock cycle.

Hazards

You might need the result of a previous step before you can move on
GPU resource conflicts
Mis-predicted branches

Performance Gains

It also takes time to get things from memory. We can basically just wait until a register is ready to do useful work.
We can also dual issue instructions if they are consecutive, take the same amount of time, and use unrelated registers, and don’t consume two of the same resource.

There are also a group of optimizations that go together:

Register renaming
- Registers are renaming so that a later set of instructions does not need to wait for an earlier pair that uses a specific register. Then these can be run in parallel.
Branch Prediction
- For a branch, we do speculative changes into a register and also keep the old register values around. We then figure out which prediction is correct and discard the wrong predictions. This allows us to get past a cache miss and keep going. Basically, we run until we start the next cache miss because the sooner it starts, the sooner it is over and the faster the program executes.
Speculation
OOO execution These all work synergistically and each adds to the benefits the other brings.

A majority of the transistors on a modern x86 chip are spent on Cache. DRAM is fast but expensive (6 transistors per bit) and SRAM is slow but cheap (1 transistor and 1 capacitor per bit).

Multi-Core Processors

These exist because clock speeds are no longer increasing. The more cores a processor has, the more instructions it can execute at the same thing. Core could share a cache.

You can do symmetric multi-threading where a single core can run multiple threads. Hyper-threading is an implementation of this.

In SMP systems, all CPUs have approximately the same access time. In NUMA systems, the system can access resources at different speeds. Memory is typically the bottleneck.

Typically, CPUs expose its hardware threads using virtual CPUs where tasks are scheduled on.

🤖 Dan Huynh

Recent Notes

ECE459

Bitcoin

CUDA Kernels

ChatGPT

Counters

Explorer

Processors

Barriers

Pipelining

Hazards

Performance Gains

Multi-Core Processors

Graph View

Recent Notes

ECE459

Bitcoin

CUDA Kernels

Table of Contents

Backlinks