LLMs

In November of 2022, OpenAI introduces ChatGPT to the world. This spawned a field of ChatGPT wrappers that take VC money and the AI boom.

Models

ChatGPT

Performance

Part of what makes GPT-3, GPT-4, etc better at producing output that matches out expectations is that it relies on pre-training. One factor matters in how good a model is here: parameters. More parameters is typically better but requires more computational and memory resources.

In the context of transformers, there are three main groups of optimizations that we need to do:

Tensor contractions
Statistically normalization
Element-wise operations

Optimizations

We want to optimize for a model that gives answers quickly, or generate / train models efficiently.

In terms of making a fast model, sometimes the LLM will speculate on what you may ask after you give a prompt and pre-compute the response?

An optimization we can make is adjusting the batch size. There is a balance here between choosing your batch size to balance using an appropriate amount of memory and completing your run in a sufficient enough time.

Gradient Accumulation

To get around memory limitations, we can use gradient accumulation to calculate gradients in small increments rather than for a whole batch. Note that too large of a batch size may overfit.

There is the concept of gradient checkpointing which increases compute time to save memory.

Other decisions

Mixed precision: use less accurate types

Data preloading: Multithread to get data to GPU faster

At the end of the day, there are also tradeoffs that we need to make:

Accuracy for time
Memory for CPU
Over or under fit?

🤖 Dan Huynh

Recent Notes

ECE459

Bitcoin

CUDA Kernels

ChatGPT

Counters

Explorer

LLMs

Models

Performance

Optimizations

Graph View

Recent Notes

ECE459

Bitcoin

CUDA Kernels

Table of Contents

Backlinks