Skip to content

Chapter 5 · Techniques

Chapters 2–4 built the picture: what a forward pass does, where it's bottlenecked, the hardware that runs it, and the software that drives it. This chapter is the craft — the applied research you actually deploy to move a workload's point on the roofline.

One organizing principle runs through everything here:

The more constraints you can introduce, the more performance you can extract

A general system that handles every case optimally handles none. Every technique in this chapter narrows the problem — fix the precision, assume a draft model is usually right, assume prompts share a prefix, pin prefill and decode to separate hardware — and trades that lost generality for speed. Inference engineering is largely the art of choosing which constraints your traffic can afford.

A second principle stacks on top: the more traffic you have, the more techniques pay off. Higher parallelism, KV-aware routing, and disaggregation only earn their complexity at volume — many GPUs, often many nodes, serving one model. At low traffic they're overhead.

The five categories

§ Technique What it attacks Lossy?
5.1 Quantization bytes per weight/value → both phases faster yes (managed)
5.2 Speculative decoding decode's idle compute → higher TPS no
5.3 Caching redundant prefill → lower TTFT no
5.4 Parallelism model/KV too big for one GPU → fit + speed no
5.5 Disaggregation prefill/decode fighting for one GPU → specialize no

Quantization is the only lossy one — every other technique is exact. If you work in a quality-critical domain and can't risk any output change, you still have four of five tools available.

Techniques interact — sometimes they fight

Optimizations are not independent. Some are symbiotic: quantizing the KV cache makes disaggregation cheaper (less to transfer) and caching denser (more fits in memory). Some are antagonistic: raising batch size to feed quantization's throughput starves speculative decoding of the spare compute it needs. The goal is a balanced set that delivers more than the sum of its parts — which is why these are knobs to tune, not boxes to check.

Learning objectives

By the end of this chapter you can:

  • Read a number format (E4M3, MXFP8, INT4) and explain its dynamic-range/precision trade
  • Order model components by quantization sensitivity and justify the order
  • Explain why speculative decoding raises TPS but never TTFT, and what caps its benefit
  • Compute KV-cache reuse from a prompt's structure and lay out context to maximize cache hits
  • Size the minimum GPU count for a model and pick TP vs EP vs PP for the situation
  • Decide whether a workload justifies disaggregation, and read xPyD deployment notation