Nvidia Dynamo, 高效的LLM分布式推理框架
【本文持续更新中..】内容目前基于2025.05及之前的内容整理总结,请注意时效性。
概览
先了解一下dynamo是什么东西,到官网看一下官方介绍,可以看到三个dynamo相关的概念,我们逐个看下介绍:
- NVIDIA Dynamo Platform:关键字已经标出来了,是一个”推理平台”,支持”任何框架、架构、部署规模”的模型。
- NVIDIA Dynamo:继续看关键字,”推理框架”,”分布式环境”场景支持,支持”所有主流推理后端”,支持分离部署。
- NVIDIA Dynamo-Triton:Dynamo-Triton看起来暂时没有什么亮点,似乎只是直接把NVIDIA之前的Triton改了个名字。
然后官网也给了一张图,可以直观的看到上述三个产品的定位与关系:
题外话…对比dynamo在nvidia大陆和官方的文档,发现目前在大陆的介绍页,只有dynamo框架,没有dynamo平台介绍。
https://developer.nvidia.com/dynamo,https://developer.nvidia.cn/dynamo
再到github看下又说了什么,先看个大概,主要是NVIDIA Dynamo框架的内容:关键信息是这个”推理引擎无关”。
然后接下来是列举了Dynamo的几个主要特性:包含支持PD分离、LLM感知请求路由等,关键特性后文我们逐一展开。
为什么需要Dynamo
Nsight cannot serve or orchestrate models (coordination and mgmt of AI models, systems, and integrations).
Triton is for general AI inference serving - it’s not optimized for distributed LLM workloads.
TensorRT-LLM, vLLM - For “Engine-level LLM inference acceleration” (basically optimizing core software/hardware/architecture-like using faster kernels, better memory handling, parallelization in the inference engine itself). So it’s not a fully distributed serving/orchestration.
Dynamo on the other hand, is for Distributed, multi-node, LLM-optimized inference. It’s designed for new scale and efficiency demands.
Dynamo提供了什么能力
GPU Resource Planner: A planning and scheduling engine that monitors capacity and prefill activity in multi-node deployments to adjust GPU resources and allocate them across prefill and decode.
Smart Router: A KV-cache-aware routing engine that efficiently directs incoming traffic across large GPU fleets in multi-node deployments to minimize costly re-computations.
Low Latency Communication Library: State-of-the-art inference data transfer library that accelerates the transfer of KV cache between GPUs and across heterogeneous memory and storage types.
KV Cache Manager: A cost-aware KV cache offloading engine designed to transfer KV cache across various memory hierarchies, freeing up valuable GPU memory while maintaining user experience.
Dynamo详解
https://github.com/ai-dynamo/dynamo/tree/main/docs/architecture
架构文档
关键要素
前端(Frontend):采用 OpenAI 类似 http 请求服务,响应用户请求
路由(Router):根据一定的策略,把前端请求分配到不同 worker
处理单元(Workers): 有 prefill/decode 两种 worker 处理 LLM 计算逻辑
实践
Deploying Dynamo Inference Graphs to Kubernetes using Helm:https://github.com/ai-dynamo/dynamo/blob/release/0.2.0/docs/guides/dynamo_deploy/manual_helm_deployment.md
参考链接
https://developer.nvidia.com/dynamo
https://github.com/ai-dynamo/dynamo#
https://github.com/ramanahm1/dynamo-learnings
https://cloud.tencent.com/developer/article/2515122?policyId=20240001&frompage=seopage
https://mp.weixin.qq.com/s/t9rm_rG2NwXaZLe_SF5_hg
版权声明
本博客所有原创内容,均采用 CC BY-NC-SA 4.0 协议,转载请注明出处。