This is like hiring a call center to handle large volumes of support calls, all in the same way. I.E, dong the same operation on different chunks of data simultaneously. This is like using CUDA Kernels.
If we want to save space (and get more parallelism), we want to make sure we are using the appropriate type and size for our data in question. This helps with parallelism because we can use more registers in total???
Pros and Cons
- More efficient way to parallelize over threads
- Data needs to be 16-byte aligned when loading or storing to and from 128-bit registers.
Alignment
Rust will usually align primitives to their size, but promises nothing. We can use repr(packed(N)) or repr(aligned(N)) directives to express constraints on alignment.
Single Threaded Performance
Usually decreasing latency is hard and requires domain specific knowledge and tweaks.