Chapter 6 · Modalities¶

Scaffolded — not yet written to depth

Outlined below.

Planned sections¶

Vision language models — video processing, omni-modal models
Embedding models — architecture and inference (and why they're throughput machines)
ASR models — single-chunk vs long-file latency, diarization
TTS models — streaming real-time speech, speech-to-speech
Image generation — kernel optimization, the "one weird trick" for faster generation
Video generation — attention optimization, quantization, context parallelism