Skip to content

Chapter 2 · Models

This is the chapter where inference stops being a black box. By the end you will be able to take a single token, trace it through every matrix multiply in a transformer, and explain exactly which step is slow and why.

We cover two model families:

  • LLMs — autoregressive token generators. They produce one token at a time, each conditioned on every token before it.
  • Image and video models — iterative denoisers. They start from noise and refine a whole canvas over many steps.

They look unrelated, but at the bottom they're the same machinery — matrix multiplies and attention — arranged differently. Understanding one makes the other easy.

Learning objectives

By the end of this chapter you can:

  • Explain why a neural network needs non-linear activations (and what breaks without them)
  • Trace a token: text → token id → embedding → transformer blocks → logits → next token
  • Describe what the KV cache is, why it exists, and what it costs
  • State why prefill is compute-bound and decode is memory-bound — and prove it with arithmetic intensity
  • Read a model's config.json and predict its memory footprint and bottleneck

The sections

  • Neural Networks

    Nodes, layers, matmul, and the one trick (non-linearity) that makes depth worth anything.

  • LLM Inference Mechanics

    Tokenization, embeddings, the transformer block, attention, the KV cache, and the prefill/decode split. The core of the book.

  • Image & Video Generation

    Diffusion, latent space, the VAE, classifier-free guidance, and why a "50-step" image is actually 100 forward passes.

  • Calculating Bottlenecks

    Ops:byte ratio, arithmetic intensity, and the roofline model — the math that tells you whether to buy compute or bandwidth.

The one idea to hold onto

Everything in inference is a fight between two resources: how fast the GPU can do math (compute) and how fast it can move numbers in and out of memory (bandwidth). Every technique in Chapter 5 is a move in that fight. This chapter teaches you to see which one you're losing.