Chapter 5 · Techniques¶
Chapters 2–4 built the picture: what a forward pass does, where it's bottlenecked, the hardware that runs it, and the software that drives it. This chapter is the craft — the applied research you actually deploy to move a workload's point on the roofline.
One organizing principle runs through everything here:
The more constraints you can introduce, the more performance you can extract
A general system that handles every case optimally handles none. Every technique in this chapter narrows the problem — fix the precision, assume a draft model is usually right, assume prompts share a prefix, pin prefill and decode to separate hardware — and trades that lost generality for speed. Inference engineering is largely the art of choosing which constraints your traffic can afford.
A second principle stacks on top: the more traffic you have, the more techniques pay off. Higher parallelism, KV-aware routing, and disaggregation only earn their complexity at volume — many GPUs, often many nodes, serving one model. At low traffic they're overhead.
The five categories¶
| § | Technique | What it attacks | Lossy? |
|---|---|---|---|
| 5.1 | Quantization | bytes per weight/value → both phases faster | yes (managed) |
| 5.2 | Speculative decoding | decode's idle compute → higher TPS | no |
| 5.3 | Caching | redundant prefill → lower TTFT | no |
| 5.4 | Parallelism | model/KV too big for one GPU → fit + speed | no |
| 5.5 | Disaggregation | prefill/decode fighting for one GPU → specialize | no |
Quantization is the only lossy one — every other technique is exact. If you work in a quality-critical domain and can't risk any output change, you still have four of five tools available.
Techniques interact — sometimes they fight
Optimizations are not independent. Some are symbiotic: quantizing the KV cache makes disaggregation cheaper (less to transfer) and caching denser (more fits in memory). Some are antagonistic: raising batch size to feed quantization's throughput starves speculative decoding of the spare compute it needs. The goal is a balanced set that delivers more than the sum of its parts — which is why these are knobs to tune, not boxes to check.
Learning objectives¶
By the end of this chapter you can:
- Read a number format (
E4M3,MXFP8,INT4) and explain its dynamic-range/precision trade - Order model components by quantization sensitivity and justify the order
- Explain why speculative decoding raises TPS but never TTFT, and what caps its benefit
- Compute KV-cache reuse from a prompt's structure and lay out context to maximize cache hits
- Size the minimum GPU count for a model and pick TP vs EP vs PP for the situation
- Decide whether a workload justifies disaggregation, and read
xPyDdeployment notation