Chapter 1 · Prerequisites¶
Scaffolded — not yet written to depth
Outlined below.
Before optimizing inference you need to know what you're serving, to whom, and what "good" means. This chapter is about framing the problem so the later chapters have a target.
Planned sections¶
- Scale and specialization — when a small specialized model beats a large general one
- About your app — AI-native vs feature; online vs offline; consumer vs B2B, and how each shapes the latency budget
- Model selection — evaluation, fine-tuning for domain quality, distillation
- Measuring latency and throughput — percentiles (p50/p95/p99), TTFT, TPS, and end-to-end metrics that actually reflect user experience