In November of 2022, OpenAI introduces ChatGPT to the world. This spawned a field of ChatGPT wrappers that take VC money and the AI boom.
Models
Performance
Part of what makes GPT-3, GPT-4, etc better at producing output that matches out expectations is that it relies on pre-training. One factor matters in how good a model is here: parameters. More parameters is typically better but requires more computational and memory resources.
In the context of transformers, there are three main groups of optimizations that we need to do:
- Tensor contractions
- Statistically normalization
- Element-wise operations
Optimizations
We want to optimize for a model that gives answers quickly, or generate / train models efficiently.
In terms of making a fast model, sometimes the LLM will speculate on what you may ask after you give a prompt and pre-compute the response?
An optimization we can make is adjusting the batch size. There is a balance here between choosing your batch size to balance using an appropriate amount of memory and completing your run in a sufficient enough time.
Gradient Accumulation
To get around memory limitations, we can use gradient accumulation to calculate gradients in small increments rather than for a whole batch. Note that too large of a batch size may overfit.
There is the concept of gradient checkpointing which increases compute time to save memory.
Other decisions
- Mixed precision: use less accurate types
- Data preloading: Multithread to get data to GPU faster
At the end of the day, there are also tradeoffs that we need to make:
- Accuracy for time
- Memory for CPU
- Over or under fit?