Skip to content

Chapter 6 · Modalities

Scaffolded — not yet written to depth

Outlined below.

Planned sections

  • Vision language models — video processing, omni-modal models
  • Embedding models — architecture and inference (and why they're throughput machines)
  • ASR models — single-chunk vs long-file latency, diarization
  • TTS models — streaming real-time speech, speech-to-speech
  • Image generation — kernel optimization, the "one weird trick" for faster generation
  • Video generation — attention optimization, quantization, context parallelism