【本文持续更新中..】内容目前基于2025.05及之前的内容整理总结，请注意时效性。

概览

先了解一下dynamo是什么东西，到官网看一下官方介绍，可以看到三个dynamo相关的概念，我们逐个看下介绍：

NVIDIA Dynamo Platform：关键字已经标出来了，是一个”推理平台”，支持”任何框架、架构、部署规模”的模型。

NVIDIA Dynamo：继续看关键字，”推理框架”，”分布式环境”场景支持，支持”所有主流推理后端”，支持分离部署。

NVIDIA Dynamo-Triton：Dynamo-Triton看起来暂时没有什么亮点，似乎只是直接把NVIDIA之前的Triton改了个名字。

然后官网也给了一张图，可以直观的看到上述三个产品的定位与关系：

题外话…对比dynamo在nvidia大陆和官方的文档，发现目前在大陆的介绍页，只有dynamo框架，没有dynamo平台介绍。
https://developer.nvidia.com/dynamo，https://developer.nvidia.cn/dynamo

再到github看下又说了什么，先看个大概，主要是NVIDIA Dynamo框架的内容：关键信息是这个”推理引擎无关”。

然后接下来是列举了Dynamo的几个主要特性：包含支持PD分离、LLM感知请求路由等，关键特性后文我们逐一展开。

为什么需要Dynamo

Nsight cannot serve or orchestrate models (coordination and mgmt of AI models, systems, and integrations).

Triton is for general AI inference serving - it’s not optimized for distributed LLM workloads.

TensorRT-LLM, vLLM - For “Engine-level LLM inference acceleration” (basically optimizing core software/hardware/architecture-like using faster kernels, better memory handling, parallelization in the inference engine itself). So it’s not a fully distributed serving/orchestration.

Dynamo on the other hand, is for Distributed, multi-node, LLM-optimized inference. It’s designed for new scale and efficiency demands.

Dynamo提供了什么能力

GPU Resource Planner: A planning and scheduling engine that monitors capacity and prefill activity in multi-node deployments to adjust GPU resources and allocate them across prefill and decode.

Smart Router: A KV-cache-aware routing engine that efficiently directs incoming traffic across large GPU fleets in multi-node deployments to minimize costly re-computations.

Low Latency Communication Library: State-of-the-art inference data transfer library that accelerates the transfer of KV cache between GPUs and across heterogeneous memory and storage types.

KV Cache Manager: A cost-aware KV cache offloading engine designed to transfer KV cache across various memory hierarchies, freeing up valuable GPU memory while maintaining user experience.