Skip to content

Inference Engineering — Deep Dive

A from-the-ground-up guide to how generative-model inference actually works — and how to make it fast.

Most inference guides tell you what the knobs do. This one explains why they exist, so that when you hit a latency wall at 2 a.m. you can reason about it from first principles instead of pattern-matching on a blog post.

It assumes you can read code and know what an array, a number, and computer memory are. Every ML-specific term is defined the first time it appears, with an analogy and usually a tiny worked example. It is long on purpose.

Why this site exists

I wanted to understand the entire LLM-inference black box — not just operate the knobs, but know why every piece works the way it does, end to end. This site is me working that out in the open, going deeper on the points other guides skip, and sharing the learnings so they're useful to anyone else chasing the same understanding. — Max

How to read this

  • Foundations first

    Chapters 0–2 build the mental furniture: what inference is, the prerequisites, and the mechanics of a forward pass. Read these top-to-bottom.

  • Then the machine

    Chapters 3–4 cover the hardware (GPUs, memory hierarchy) and the software stack (CUDA, inference engines) that run the math.

  • Then the craft

    Chapters 5–6 are the techniques (quantization, speculative decoding, caching, parallelism) and the modalities (vision, audio, embeddings) where they're applied.

  • Then production

    Chapter 7 is autoscaling, cold starts, multi-cloud capacity, and observability.

The map

# Chapter What you'll be able to do
0 Inference Frame the problem: training vs inference, latency vs throughput
1 Prerequisites Pick a model, define your latency budget, measure it honestly
2 Models Trace a token through a transformer; find the bottleneck
3 Hardware Read a GPU spec sheet and predict performance
4 Software Choose and reason about an inference engine
5 Techniques Quantize, cache, speculate, and parallelize on purpose
6 Modalities Apply the techniques to vision, audio, and embeddings
7 Production Scale, deploy, and observe a real serving system

Status

This is still work in progress chapters are being added.

Sources & acknowledgements

This site is heavily based on two excellent books, which provided the structure and much of the source material. I reorganized, went deeper on the points I wanted to understand fully, and added my own worked examples, diagrams, and hands-on guides — but the foundations are theirs, and I'm grateful for both:

  • Inference Engineering — Philip Kiely (Baseten Books, 2026). The eight-chapter arc and the framing of this site follow the book; its companion site is inferenceengineering.tech.
  • Quantization and Fast Inference: A Practitioner's Guide to Efficient AI — Vivek Kalyanarangan (Manning, 2026). The basis for the deeper quantization material (number formats, the affine mapping, scale and zero-point).

This is an independent personal learning project — not affiliated with or endorsed by either author or publisher. If you want the canonical, authoritative treatments, read the books.