Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Notifications You must be signed in to change notification settings

Nanji-Huaji/DuoDecoding

Repository files navigation

DuoDecoding

This is an experiment framework based on Duodecoding.

Setup

Firstly, install the requirements with:

pip install -r requirements.txt

Then, download models in the following:

Llama Series:

Vicuna Series:

And put them to:

./<llama or vicuna>/<your-model-dir>

If you occur an path problem, you can modify model path on src/utils.py.

Model paths are defined on the zoo dict on the model_zoo function. And their vocab sizes are defined on the vocab_size dict on the same function.

Run

Run experiments via ./exp.py:

python exp.py

Args

This script is used to run given experiments automatically. Basically, the following args are needed:

Args Name Meaning Choice
eval_mode The decoding method selected for the experiment dist_spec, dist_split_spec, tridecoding, uncertainty_decoding, adaptive_decoding
draft_model The draft model for speculative decoding methods Currently, tiny-llama-1.1b, tiny-vicuna-1b, llama-68m, and vicuna-68m are available
target_model The target model for speculative decoding methods Currently, tiny-llama-1.1b, tiny-vicuna-1b, Llama-2-13b, and vicuna-13b-v1.5 are available.
small_model The smallest model for tridecoding methods Currently, llama-68m, and vicuna-68m are available
gamma The number of drafted tokens on speculative decoding.
Automatically ignored on autoregressive or tridecoding methods.
An int
gamma1 The number of drafted tokens in end-edge level of the tri-decoding method.
Automatically ignored on other methods.
An int
gamma2 The number of drafted tokens in the edge-cloud level of tri-decoding method.
Automatically ignored on other methods.
An int
edge_end_bandwidth The bandwidth between the edge and the end device.
Only available on the methods implemented for transmission simulation.
A float
edge_cloud_bandwidth The bandwidth between the edge and the cloud device.
Ibid.
A float
cloud_end_bandwidth The bandwidth between the edge and the end device.
Ibid.
A float
max_tokens Max tokens for each sample of the dataset. An int
temp Temperature for the target model. Default: 0.0 A float
exp_name The name of your experiment. A str
ntt_ms_edge_cloud [Depreciated] The non-transmission time between the edge and the cloud device. A float
ntt_ms_edge_end [Depreciated] Ibid. A float
transfer_top_k Args for top-k compression on the methods that were implemented for transmission simulation. An int

Note

Adaptive Decoding and Triadaptive Decoding require the acceptance prediction head path, whose checkpoints for tinyllama-1.1b and llama-2-13b are available on here.

Adding New Experiments

Experiments are defined based on the following Typeddict:

class ExpConfig(TypedDict):
    CUDA_VISIBLE_DEVICES: Literal['0', '1'] # Sorry for only considered only for the machines having 1 or 2 gpus.
    eval_mode: str
    edge_end_bandwidth: int
    edge_cloud_bandwidth: int
    cloud_end_bandwidth: int
    transfer_top_k: int
    exp_name: str
    use_precise: bool
    ntt_ms_edge_cloud: int | float
    ntt_ms_edge_end: int | float

ExpConfig is not the same as the table above, for currently only do the experiments for llama series. You can add args on the table above to the typeddict and pass them to the scripts if you need.

After defined your experiment configs, you can append them to the config_to_run list. And the experiment configs are to be run if the script is executed.

Reading the Results

table_generator_ver2.ipynb is used for generating the table of experiment results.

After you executed the experiment script, it automatically generates a json file named experiment_summary_<experiment_date>.json. You can change the file_path var to your json file and run the jupyter notebook:

# ... exixting codes
def main():
    # 读取实验数据
    file_path = "experiment_summary_20250920_145859.json" # change heres
# ... existing codes

Run via Bash Script

There are some experiment scirpts on cmds. Run them directly can also run the experiments.

About

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published