This is like hiring a call center to handle large volumes of support calls, all in the same way.

If we want to save space (and get more parallelism), we want to make sure we are using the appropriate type and size for our data in question. This helps with parallelism because we can use more registers in total???

Pros and Cons

  • More efficient way to parallelize over threads
  • Data needs to be 16-byte aligned when loading or storing to and from 128-bit registers.

Alignment

Rust will usually align primitives to their size, but promises nothing. We can use repr(packed(N)) or repr(aligned(N)) directives to express constraints on alignment.

Single Threaded Performance

Usually decreasing latency is hard and requires domain specific knowledge and tweaks.

Case Study: Stream VByte

This uses variable size bytes for integers which is harder but sometimes we can use less bits which is a performance improvement. We have control bits that tell us how many bytes we actually need to read. Then it seems we do some weird shuffle where we put decoded integers that we have read into known position in the SIMD register which is aligned.

On realistic input, this performs well. This is because:

  • Control bytes are sequential and the CPU can always prefetch the next control byte since its location is predictable.
  • Data bytes are also sequential
  • The weird shuffling exploits the instruction set
  • Control flow is regular