Chapter 6 · Modalities¶
Scaffolded — not yet written to depth
Outlined below.
Planned sections¶
- Vision language models — video processing, omni-modal models
- Embedding models — architecture and inference (and why they're throughput machines)
- ASR models — single-chunk vs long-file latency, diarization
- TTS models — streaming real-time speech, speech-to-speech
- Image generation — kernel optimization, the "one weird trick" for faster generation
- Video generation — attention optimization, quantization, context parallelism