In November of 2022, OpenAI introduces ChatGPT to the world. This spawned a field of ChatGPT wrappers that take VC money and the AI boom.

Models

Performance

Part of what makes GPT-3, GPT-4, etc better at producing output that matches out expectations is that it relies on pre-training. One factor matters in how good a model is here: parameters. More parameters is typically better but requires more computational and memory resources.

In the context of transformers, there are three main groups of optimizations that we need to do:

  1. Tensor contractions
  2. Statistically normalization
  3. Element-wise operations

Optimizations

We want to optimize for a model that gives answers quickly, or generate / train models efficiently.

In terms of making a fast model, sometimes the LLM will speculate on what you may ask after you give a prompt and pre-compute the response?

An optimization we can make is adjusting the batch size. There is a balance here between choosing your batch size to balance using an appropriate amount of memory and completing your run in a sufficient enough time.

Gradient Accumulation

To get around memory limitations, we can use gradient accumulation to calculate gradients in small increments rather than for a whole batch. Note that too large of a batch size may overfit.

There is the concept of gradient checkpointing which increases compute time to save memory.

Other decisions

  • Mixed precision: use less accurate types
  • Data preloading: Multithread to get data to GPU faster

At the end of the day, there are also tradeoffs that we need to make:

  • Accuracy for time
  • Memory for CPU
  • Over or under fit?