Chapter 0 · Inference¶
Scaffolded — not yet written to depth
This chapter is outlined below. Chapter 2 (Models) is the depth reference; we fill the rest in iteratively.
Inference is using a trained model to produce an output — as opposed to training, which is the (vastly more expensive, one-time) process of creating the model's weights. You train once; you run inference billions of times. That asymmetry is why inference engineering exists as a discipline.
Planned sections¶
- What inference is, and how it differs from training (compute profile, who pays the cost)
- The two metrics that govern everything: latency (time to a result) and throughput (results per second), and why they trade off
- Why a model that's cheap to call can be expensive to serve
- The shape of the rest of the book