Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@hanq-moreh
Copy link

@hanq-moreh hanq-moreh commented Nov 10, 2025

Motivation

The goal of this pull request is to enable hot-swapping of the draft model without restarting the serving server.
Currently, updating the speculative draft model requires a full server restart, which interrupts ongoing requests and complicates integration with runtime speculative model training pipelines.

By allowing the draft model to be reloaded dynamically at runtime, we can:

  • Continuously update and redeploy the draft model while the service is running.
  • Avoid service downtime and reduce operational overhead.

Modifications

  • Added is_draft_model filed to UpdateWeightFromDiskReqInput in io_struct.py to distinguish draft model updates.
  • Implemented update_weights_from_disk() in eagle_worker.py
  • Refactored setting up lm_head and embedding for draft model into set_embed_and_head() in eagle_worker.py.
  • Updated update_weights_from_disk() in scheduler_update_weights_mixin.py to handle is_draft_model=True for draft model weight updates.
  • Added self.pending_weight_update_queue and maybe_process_pending_weight_update() in scheduler.py to defer weight updates until no running batch is active.
  • Add test code in test/srt/rl/test_update_weights_from_disk.py

Accuracy Tests

Benchmarking and Profiling

Checklist

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants