SCBench (SharedContextBench) is a comprehensive benchmark to evaluate efficient long-context methods on multi-turn and multi-request interactions to analyze their performance across the full KV cache lifecycle (generation, compression, retrieval, and loading).
Note
- datasets >= 2.15.0
You can download and load the SCBench data through the Hugging Face datasets (🤗 HF Repo):
from datasets import load_dataset
datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"]
for dataset in datasets:
data = load_dataset("microsoft/SCBench", dataset, split="test")All data in SCBench are standardized to the following format:
{
"id": "Random id for each piece of data.",
"context": "The long context required for the task, such as repo-code, long-document, and many-shot.",
"multi_turns": [{"input": "multi-turn question.", "answer": "multi-turn reference answer."}],
}We implement Multi-Turn and Multi-Request modes with HF and vLLM in GreedySearch and GreedySearch_vllm two class. Please refer the follow scripts to run the experiments.
for all methods,
cd scbench
# Single-GPU, in Multi-Turn Mode
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 1 multi-turn
# Multi-GPU, in Multi-Turn Mode
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 2 multi-turn
# Multi-GPU, in Multi-Request Mode
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 2 scdqfor single methods,
cd scbench
# Single-GPU, in Multi-Turn Mode, using attn_type: vllm, kv_type: dense
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 1 multi-turn vllm dense
# Multi-GPU, in Multi-Turn Mode, using attn_type: vllm, kv_type: dense
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 2 multi-turn vllm dense
# Multi-GPU, in Multi-Request Mode, using attn_type: vllm, kv_type: dense
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 2 scdq vllm denseMore details about attn_type and kv_type, please refer to this section: Supported Efficient Methods.
First, build the environment, see basic environment.
Run the test:
bash scripts/test_llama.shRun multiple tasks in one command:
bash scripts/run_all_tasks.shSpecify the max sequence length, max number of turns, and number of eval examples:
--max_seq_length: The maximum sequence length for the test.--max_turns: The maximum number of turns for the test.--num_eval_examples: The number of test examples to use, use all examples in default.
--attn_type: The attention type to use.--kv_type: The KV cache type to use.
For example, run with MInference and SnapKV:
bash scripts/test_minference_with_snapkv.shThe supported efficient long-context methods are as follows:
attn_type:
dense: Dense attentionminference: MInferencea_shape: A-Shapetri_shape: Tri-Shape
kv_type:
dense: Dense KV cachekivi: KIVIsnapkv: SnapKVquest: Questpyramidkv: PyramidKVstreamingllm: StreamingLLM
You will need to build specific environment for different attention types and KV cache types, see section Environment for more details.
SCBench covers 12 diverse tasks that test four key long-context capabilities: string retrieval, semantic retrieval, global information processing, and multi-tasking.
scbench_kv: Tests key-value lookup in large JSON objects with random, incompressible contentscbench_prefix_suffix: Evaluates finding strings with specific prefix and suffix patternsscbench_vt: Assesses multi-hop variable tracing capabilities in long inputs
scbench_repoqa: Function retrieval from large codebases based on natural language descriptionsscbench_qa_eng,scbench_qa_chn,scbench_choice_eng: Includes English QA, Chinese QA, and multi-choice questions on long texts- Requires semantic understanding on length inputs
scbench_many_shot: Tests in-context learning with hundreds of examplesscbench_mf: Statistical tasks on large arraysscbench_summary: Summarization of documents- Requires global information processing or aggregation
scbench_summary_with_needles: Combines summarization with needle-in-haystack searchscbench_repoqa_and_kv: Integrates code function retrieval with key-value lookup- Requires multi-tasking or multi-step reasoning
The benchmark evaluates these tasks across two shared context modes:
- Multi-turn Mode: the default mode of our SCBench
- Multi-request Mode: use
--same_context_different_queryto enable this mode
conda create -n scbench python=3.10 -y && conda activate scbench
pip install torch
pip install minference
pip install flash-attn --no-build-isolation
git clone https://github.com/microsoft/MInference.git && cd MInference/scbench
pip install -r requirements.txtMInference natively supports many efficient long-context methods, but you will need to build specific environment for the following methods:
kivi:
bash setup/setup_kivi.sh- minference
best_pattern(loaded from config file)
- a_shape
n_local(default: 3968)n_init(default: 128)
- tri_shape
n_local(default: 3968)n_init(default: 128)n_last(default: 100)
- kivi
bits(default: 2)group_size(default: 32)residual_length(default: 32)
- snapkv/pyramidkv
window_size(default: 32)max_capacity_prompt(default: 4096)kernel_size(default: 5)pooling(default: "avgpool")
- quest
chunk_size(default: 16)token_budget(default: 1024)
- streamingllm
n_local(default: 3968)n_init(default: 128)
Note: All these parameters can be overridden by passing custom values in --hyper_param in cli, for example:
python run_multiturnbench.py .... --hyper_param '{"n_local": 4096}'
Our SCBench is the first long-context benchmark that covers single-turn, multi-turn, and multi-request scenarios. In addition, our impelmentation also involves KV cache reuse techniques, thereby providing a more comprehensive analysis on the full KV cache lifecycle of efficient long-context methods.
@article{li2024scbench,
title={SCBench: A KV cache-centric analysis of long-context methods},
author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili},
journal={arXiv preprint arXiv:2412.10319},
year={2024}
}

