DeepSeek group-limited expert routing和负载均衡

Ref

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py

GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer

DeepSeek-V3 Technical Report

DeepSeek的路由方法

class Gate(nn.Module):def __init__(self, args: ModelArgs):super().__init__()self.dim = args.dimself.topk = args.n_activated_expertsself.n_groups = args.n_expert_groupsself.topk_groups = args.n_limited_groupsself.score_func = args.score_funcself.route_scale = args.route_scaleself.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else Nonedef forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:scores = linear(x, self.weight)if self.score_func == "softmax":scores = scores.softmax(dim=-1, dtype=torch.float32)else:scores = scores.sigmoid()original_scores = scoresif self.bias is not None:scores = scores + self.biasif self.n_groups > 1: # n_groups = n_expert_groups = 8scores = scores.view(x.size(0), self.n_groups, -1)if self.bias is None:group_scores = scores.amax(dim=-1)else:group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)indices = group_scores.topk(self.topk_groups, dim=-1)[1] # topk_groups = n_limited_groups = 4mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)scores = (scores * mask.unsqueeze(-1)).flatten(1)indices = torch.topk(scores, self.topk, dim=-1)[1] # topk = n_activated_experts = 8weights = original_scores.gather(1, indices)if self.score_func == "sigmoid":weights /= weights.sum(dim=-1, keepdim=True)weights *= self.route_scalereturn weights.type_as(x), indices

每个token有1个共享专家和256个路由专家,对这256个路由专家选择出8个专家进行实际的计算。

常规的MOE是根据score选择最高的topk个专家,具有较大的随机性。

但是DeepSeek把这256个专家分成了连续的n_expert_groups=8个组,然后根据每个组的最高score选择出n_limited_groups=4个组,其他组的score值清零。使得最终选择的topk=8随机分布在4个组内。但并没有限制每个组一定有多少个专家选中。这一定程度上限制了topk出现的随机性。

Expert-Parallel Load Balancer

  • 核心问题:对于给定 MoE 模型,存在一些天然的高负载专家(expert),导致不同 GPU 的专家计算负载不均衡
  • 优化目标:每个 GPU 上的专家计算量均衡(即最小化所有 GPU 的 dispatch 接收量的最大值)。To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens.

Moreover, thanks to the group-limited expert routing used in DeepSeek-V3, we also attempt to place the experts of the same group to the same node to reduce inter-node data traffic, whenever possible.

Prefill阶段 Hierarchical Load Balancing

Prefill:路由专家 EP32、MLA 和共享专家 DP32,一个部署单元是 4 节点,32 个冗余路由专家,每张卡 9 个路由专家和 1 个共享专家

When the number of server nodes divides the number of expert groups, we use the hierarchical load balancing policy to harness the group-limited expert routing. We first pack the expert groups to nodes evenly, ensuring the loads of different nodes are balanced. Then, we replicate the experts within each node. Finally, we pack the replicated experts to individual GPUs to ensure different GPUs are load-balanced. The hierarchical load balancing policy can be used in prefilling stage with a smaller expert-parallel size.

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.
Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.
Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.

Decoding阶段 Global Load Balancing

Decode:路由专家 EP144、MLA 和共享专家 DP144,一个部署单元是 18 节点,32 个冗余路由专家,每张卡 2 个路由专家和 1 个共享专家

In other cases, we use the global load balancing policy that replicates the experts globally regardless of expert groups, and pack the replicated experts to individual GPUs. This policy can be adopted in decoding stage with a larger expert-parallel size.

During decoding, we treat the shared expert as a routed one. From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320. For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency. 

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/71618.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Python的Pandas和matplotlib库:让数据可视化贼简单

在数据爆炸的时代,数据可视化已成为数据分析的关键环节。Python 作为强大的编程语言,拥有众多用于数据可视化的库,而 pandas 库在其中扮演着重要角色。它不仅能高效处理和分析数据,还具备强大的数据可视化功能,让我们轻…

【代码分享】基于IRM和RRT*的无人机路径规划方法详解与Matlab实现

基于IRM和RRT*的无人机路径规划方法详解与Matlab实现 1. IRM与RRT*的概述及优势 IRM(Influence Region Map)通过建模障碍物的影响区域,量化环境中的安全风险,为RRT算法提供启发式引导。RRT(Rapidly-exploring Random…

ubuntu打包 qt 程序,不用每次都用linuxdeployqt打包

用linuxdeployqt打包太麻烦,每次程序编译都要用linuxdeployqt打包一次,而且每次都要很长时间,通过研究得出一个新的打包方法 1.用用linuxdeployqt得出依赖的库文件(只要没有增加新模块,只要用一次就可以) …

Github 2025-03-06 Go开源项目日报 Top10

根据Github Trendings的统计,今日(2025-03-06统计)共有10个项目上榜。根据开发语言中项目的数量,汇总情况如下: 开发语言项目数量Go项目10Terraform:基础设施即代码的开源工具 创建周期:3626 天开发语言:Go协议类型:OtherStar数量:40393 个Fork数量:9397 次关注人数:…

redis 与 DB 的一致性 7 种策略

为什么要使用 redis 做缓存?封底估算为什么是单行数据的QPS,而不是总的? 什么时候使用DB,Redis,本地缓存 数据的分类一致性的方案1. 先清除Redis,再更新 DB2. 先更新DB,再清除 Redis使用场景: 3. 延迟删除与延迟双删使用场景 4. 监听 binlog 清除5. 双写使用场景: 6. 监听bin…

使用 Elasticsearch 进行集成测试初始化​​数据时的注意事项

作者:来自 Elastic piotrprz 在创建应该使用 Elasticsearch 进行搜索、数据聚合或 BM25/vector/search 的软件时,创建至少少量的集成测试至关重要。虽然 “模拟索引” 看起来很诱人,因为测试甚至可以在几分之一秒内运行,但它们实际…

【selenium工具操作web页面中的下拉框元素 】

使用F12定位下拉框中的元素 使用F12定位下拉框中的元素 1、有一类元素不是直接显示的页面上的,而是需要点击某些其他元素后才会显示在页面上,比如这里的下拉框。 2、这类元素会有一个特点:鼠标如果移开(没在元素上),这些元素就会…

C++ set map 详解

文章目录 1. 容器2. set和multiset2.1 set2.1.1 构造函数2.1.2 insert和erase2.1.2.1 insert2.1.2.2 erase 2.1.3 查找和访问2.1.3.1 set迭代器相关2.1.3.2 find && count2.1.3.3 范围查找 2.2 multiset2.2.1 insert和erase2.2.2 find和count 2.3 set和multiset的在算法…

Unity网络开发基础 (2) 网络协议基础

本文章不作任何商业用途 仅作学习与交流 部分图片来自Unity唐老师 目录 1.虚拟模型 2.实际模型 TCP/IP 3.传输层协议 TCP/UDP TCP 协议详解 1. 核心机制 2. 头部格式(20 字节最小) UDP 协议详解 1. 核心特点 2. 头部格式(固定 8 字节…

HTML label 标签使用

点击 <label> 标签通常会使与之关联的表单控件获得焦点或被激活。 通过正确使用 <label> 标签&#xff0c;可以使表单更加友好和易于使用&#xff0c;同时提高整体的可访问性。 基本用法 <label> 标签通过 for 属性与 id 为 username 的 <input> 元素…

JDBC、MyBatis 、MyBatis-Plus面试总结(一)

以下为你整理了一些 MyBatis 和 MyBatis-Plus 中 mapper.xml 相关的常见面试问题及答案&#xff1a; 基础概念类 问题 1&#xff1a;什么是 mapper.xml 文件&#xff0c;它在 MyBatis 中有什么作用&#xff1f; 答案&#xff1a;mapper.xml 文件是 MyBatis 中用于定义 SQL 语…

GCC RISCV 后端 -- GCC Passes 注释

在前面文章提到&#xff0c;当GCC 前端完成对C源代码解析完成后&#xff0c;就会使用 处理过程&#xff08;Passes&#xff09;机制&#xff0c;通过一系列的处理过程&#xff0c;将 GENERIC IR 表示的C程序 转步转换成 目标机器的汇编语言。过程描述如下图所示&#xff1a; 此…

基于Python实现的智能旅游推荐系统(Django)

基于Python实现的智能旅游推荐系统(Django) 开发语言:Python 数据库&#xff1a;MySQL所用到的知识&#xff1a;Django框架工具&#xff1a;pycharm、Navicat 系统功能实现 总体设计 系统实现 系统首页模块 统首页页面主要包括首页&#xff0c;旅游资讯&#xff0c;景点信息…

鸿蒙全栈开发 D2

课程目标 掌握ArkTS基础语法与核心概念理解声明式UI开发范式能独立开发简单鸿蒙应用组件建立规范的代码编写习惯 第一部分&#xff1a;初识ArkTS 1.1 语言全景认知 #mermaid-svg-V5mnjQN3DAHkfoBo {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size…

【YashanDB认证】yashandb23.3.1 个人版单机部署安装实践

YCA报名链接如下: YashanDB|崖山数据库系统YashanDB学习中心-YCA认证详情 目前免费 主要参考文档&#xff1a; 单机&#xff08;主备&#xff09;部署 | YashanDB Doc 另外还参考摩天轮文章&#xff1a; YashanDB 23.2.9.101 企业版安装步骤抢先看&#xff01; - 墨天轮 …

【蓝桥杯】每天一题,理解逻辑(3/90)【Leetcode 快乐数】

闲话系列&#xff1a;每日一题&#xff0c;秃头有我&#xff0c;Hello&#xff01;&#xff01;&#xff01;&#xff01;&#xff01;,我是IF‘Maxue&#xff0c;欢迎大佬们来参观我写的蓝桥杯系列&#xff0c;我好久没有更新博客了&#xff0c;因为up猪我寒假用自己的劳动换了…

爬虫Incapsula reese84加密案例:Etihad航空

声明: 该文章为学习使用,严禁用于商业用途和非法用途,违者后果自负,由此产生的一切后果均与作者无关 一、找出需要加密的参数 1.js运行 atob(‘aHR0cHM6Ly93d3cuZXRpaGFkLmNvbS96aC1jbi8=’) 拿到网址,F12打开调试工具,随便搜索航班,切换到network搜索一个时间点可以找…

缓存雪崩 缓存击穿 缓存穿透

1. redis使用场景-缓存-缓存穿透 在实际开发中&#xff0c;Redis 被广泛应用于缓存&#xff0c;以提高系统性能和响应速度。然而&#xff0c;在使用缓存时&#xff0c;需要注意一些问题&#xff0c;其中 缓存穿透 是一个常见且需要重点关注的场景。 什么是缓存穿透 ● 缓存穿…

【YOLOv12改进trick】多尺度大核注意力机制MLKA模块引入YOLOv12,实现多尺度目标检测涨点,含创新点Python代码,方便发论文

🍋改进模块🍋:多尺度大核注意力机制(MLKA) 🍋解决问题🍋:MLKA模块结合多尺度、门控机制和空间注意力,显著增强卷积网络的模型表示能力。 🍋改进优势🍋:超分辨的MLKA模块对小目标和模糊目标涨点很明显 🍋适用场景🍋:小目标检测、模糊目标检测等 🍋思路…

better-sqlite3之exec方法

在 better-sqlite3 中&#xff0c;.exec() 方法用于执行包含多个 SQL 语句的字符串。与预编译语句相比&#xff0c;这种方法性能较差且安全性较低&#xff0c;但有时它是必要的&#xff0c;特别是当你需要从外部文件&#xff08;如 SQL 脚本&#xff09;中执行多个 SQL 语句时。…