These are things that have cores that execute instructions.
The x86 processor is a complex instruction set computer processor. In other words, there are a lot of assembly instructions. This makes things hard to implement and pipeline for people other than assembly programmers.
Program performance
This was
Therefore, an optimization goal was mainly to limit the number of page faults, or if you were working in an embedded system with no disk (so no page faults are possible), the goal is to minimize instruction count.
Eventually we got reduced instruction set computers (RISC)’s which scale easier and introduce more pipelining but were harder to program.
Barriers
- Instruction level parallelism is getting close to the limit of what we can do.
- The speed of memory advances has not at all kept up with the advances of CPU technology. Runtime is now dominated by cache misses instead of page faults.
- We are approaching the universal speed limit (speed of light).
Pipelining
To complete an instruction, there are 5 basic steps:
- Fetch an instruction from memory
- Decode the instruction
- Fetch needed operations
- Perform the operation
- Write the result
Pipelining is basically being able to decode and fetch and instruction within the same clock cycle.
Hazards
- You might need the result of a previous step before you can move on
- GPU resource conflicts
- Mis-predicted branches
Performance Gains
It also takes time to get things from memory. We can basically just wait until a register is ready to do useful work.
We can also dual issue instructions if they are consecutive, take the same amount of time, and use unrelated registers, and don’t consume two of the same resource.
There are also a group of optimizations that go together:
- Register renaming
- Registers are renaming so that a later set of instructions does not need to wait for an earlier pair that uses a specific register. Then these can be run in parallel.
- Branch Prediction
- For a branch, we do speculative changes into a register and also keep the old register values around. We then figure out which prediction is correct and discard the wrong predictions. This allows us to get past a cache miss and keep going. Basically, we run until we start the next cache miss because the sooner it starts, the sooner it is over and the faster the program executes.
- Speculation
- OOO execution These all work synergistically and each adds to the benefits the other brings.
A majority of the transistors on a modern x86 chip are spent on Cache. DRAM is fast but expensive (6 transistors per bit) and SRAM is slow but cheap (1 transistor and 1 capacitor per bit).
Multi-Core Processors
These exist because clock speeds are no longer increasing. The more cores a processor has, the more instructions it can execute at the same thing. Core could share a cache.
You can do symmetric multi-threading where a single core can run multiple threads. Hyper-threading is an implementation of this.
In SMP systems, all CPUs have approximately the same access time. In NUMA systems, the system can access resources at different speeds. Memory is typically the bottleneck.
Typically, CPUs expose its hardware threads using virtual CPUs where tasks are scheduled on.