Chapter 4 · Software¶

Scaffolded — not yet written to depth

Outlined below.

Planned sections¶

CUDA — kernels for inference, kernel selection, kernel fusion to cut memory traffic
Frameworks and libraries — PyTorch, model file formats, ONNX Runtime, TensorRT, Transformers and Diffusers
Inference engines — vLLM, SGLang, TensorRT-LLM, and what differentiates them
NVIDIA Dynamo — disaggregated serving
Benchmarking and load testing — tooling, methodology, profiling