Chapter 2 · Models¶

This is the chapter where inference stops being a black box. By the end you will be able to take a single token, trace it through every matrix multiply in a transformer, and explain exactly which step is slow and why.

We cover two model families:

LLMs — autoregressive token generators. They produce one token at a time, each conditioned on every token before it.
Image and video models — iterative denoisers. They start from noise and refine a whole canvas over many steps.

They look unrelated, but at the bottom they're the same machinery — matrix multiplies and attention — arranged differently. Understanding one makes the other easy.

Learning objectives¶

By the end of this chapter you can:

Explain why a neural network needs non-linear activations (and what breaks without them)
Trace a token: text → token id → embedding → transformer blocks → logits → next token
Describe what the KV cache is, why it exists, and what it costs
State why prefill is compute-bound and decode is memory-bound — and prove it with arithmetic intensity
Read a model's config.json and predict its memory footprint and bottleneck

The sections¶

Neural Networks

Nodes, layers, matmul, and the one trick (non-linearity) that makes depth worth anything.
LLM Inference Mechanics

Tokenization, embeddings, the transformer block, attention, the KV cache, and the prefill/decode split. The core of the book.
Image & Video Generation

Diffusion, latent space, the VAE, classifier-free guidance, and why a "50-step" image is actually 100 forward passes.
Calculating Bottlenecks

Ops:byte ratio, arithmetic intensity, and the roofline model — the math that tells you whether to buy compute or bandwidth.

The one idea to hold onto

Everything in inference is a fight between two resources: how fast the GPU can do math (compute) and how fast it can move numbers in and out of memory (bandwidth). Every technique in Chapter 5 is a move in that fight. This chapter teaches you to see which one you're losing.