Chapter 4 · Software¶
Scaffolded — not yet written to depth
Outlined below.
Planned sections¶
- CUDA — kernels for inference, kernel selection, kernel fusion to cut memory traffic
- Frameworks and libraries — PyTorch, model file formats, ONNX Runtime, TensorRT, Transformers and Diffusers
- Inference engines — vLLM, SGLang, TensorRT-LLM, and what differentiates them
- NVIDIA Dynamo — disaggregated serving
- Benchmarking and load testing — tooling, methodology, profiling