Skip to content

Chapter 4 · Software

Scaffolded — not yet written to depth

Outlined below.

Planned sections

  • CUDA — kernels for inference, kernel selection, kernel fusion to cut memory traffic
  • Frameworks and libraries — PyTorch, model file formats, ONNX Runtime, TensorRT, Transformers and Diffusers
  • Inference engines — vLLM, SGLang, TensorRT-LLM, and what differentiates them
  • NVIDIA Dynamo — disaggregated serving
  • Benchmarking and load testing — tooling, methodology, profiling