Codestin Search App

Flakes342 · 2025-09-02T17:27:12Z

Description

This PR fixes a bug in inference/engine.py where num_experts (moe_experts) was incorrectly passed as the expert parallel group size (ep_size) when creating expert parallel groups.

Currently:

if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.moe_experts)

This causes invalid behavior whenever num_experts > world_size, because _create_ep_parallel_group expects a group size, not the total number of experts as pointed out by @Arnoochka

Root Cause

num_experts = number of experts inside the MoE layer.

ep_size = how many GPUs to group together for expert parallelism.

These were mixed up in the code.

##Fix

Replaced the incorrect call with the proper ep_size argument:

if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.ep_size)

Additionally, added a safety check in _create_ep_parallel_group to catch invalid configurations:

num_ep_groups = dist.get_world_size() // moe_ep_size
if num_ep_groups == 0:
    raise ValueError(
        f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}"
    )

Backward compatibility

If a user was already running with ep_size >= num_experts, the old code worked fine which would still work fine.
Only the previously broken case (num_experts > world_size) now works correctly.

tohtana

@Flakes342 Great catch, thank you for the fix!

tohtana · 2025-09-09T16:07:05Z

@Flakes342 nv-mii raised an error, but it is not related to this PR. Let's merge once you fix the formatting and DCO.

Flakes342 requested review from hwchen2017 and tohtana as code owners September 2, 2025 17:27

Flakes342 mentioned this pull request Sep 2, 2025

[BUG] InferenceEngine._create_ep_parallel_group uses num_experts instead of ep_size, causing incorrect behavior #7535

Closed

Flakes342 changed the title ~~[MoE] Fixed misuse of num_experts as expert parallel group size (ep_size)~~ [MoE] Fix misuse of num_experts as expert parallel group size (ep_size) Sep 3, 2025

tohtana approved these changes Sep 9, 2025

View reviewed changes

Flakes342 force-pushed the master branch from c3baa66 to bebccb1 Compare September 9, 2025 16:34

Flakes342 requested review from loadams and tjruwase as code owners September 9, 2025 16:34

Flakes342 force-pushed the master branch from 218fafd to 194e30f Compare September 9, 2025 17:41

Flakes342 requested review from GuanhuaWang and jomayeri as code owners September 9, 2025 17:41

Flakes342 force-pushed the master branch from 82f220b to 194e30f Compare September 9, 2025 18:31

Flakes342 closed this Sep 9, 2025

Flakes342 force-pushed the master branch from 421f632 to 533e834 Compare September 9, 2025 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size)#7537

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size)#7537
Flakes342 wants to merge 0 commit into
deepspeedai:masterfrom
Flakes342:master

Flakes342 commented Sep 2, 2025

Uh oh!

tohtana left a comment

Uh oh!

tohtana commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Flakes342 commented Sep 2, 2025

Description

Root Cause

Backward compatibility

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants