scbench

SCBench

SCBench (SharedContextBench) is a comprehensive benchmark to evaluate efficient long-context methods on multi-turn and multi-request interactions to analyze their performance across the full KV cache lifecycle (generation, compression, retrieval, and loading).

Note

datasets >= 2.15.0

Load Data

You can download and load the SCBench data through the Hugging Face datasets (🤗 HF Repo):

from datasets import load_dataset

datasets = ["scbench_kv", "scbench_prefix_suffix", "scbench_vt", "scbench_repoqa", "scbench_qa_eng", "scbench_qa_chn", "scbench_choice_eng", "scbench_many_shot", "scbench_summary", "scbench_mf", "scbench_summary_with_needles", "scbench_repoqa_and_kv"]

for dataset in datasets:
    data = load_dataset("microsoft/SCBench", dataset, split="test")

Data Format

All data in SCBench are standardized to the following format:

{
    "id": "Random id for each piece of data.",
    "context": "The long context required for the task, such as repo-code, long-document, and many-shot.",
    "multi_turns": [{"input": "multi-turn question.", "answer": "multi-turn reference answer."}],
}

Experiments

We implement Multi-Turn and Multi-Request modes with HF and vLLM in GreedySearch and GreedySearch_vllm two class. Please refer the follow scripts to run the experiments.

for all methods,

cd scbench
# Single-GPU, in Multi-Turn Mode
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 1 multi-turn
# Multi-GPU, in Multi-Turn Mode
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 2 multi-turn
# Multi-GPU, in Multi-Request Mode
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 2 scdq

for single methods,

cd scbench
# Single-GPU, in Multi-Turn Mode, using attn_type: vllm, kv_type: dense
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 1 multi-turn vllm dense
# Multi-GPU, in Multi-Turn Mode, using attn_type: vllm, kv_type: dense
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 2 multi-turn vllm dense
# Multi-GPU, in Multi-Request Mode, using attn_type: vllm, kv_type: dense
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 2 scdq vllm dense

More details about attn_type and kv_type, please refer to this section: Supported Efficient Methods.

Run the benchmark

First, build the environment, see basic environment.

Run the test:

bash scripts/test_llama.sh

Run multiple tasks in one command:

bash scripts/run_all_tasks.sh

Specify the max sequence length, max number of turns, and number of eval examples:

--max_seq_length: The maximum sequence length for the test.
--max_turns: The maximum number of turns for the test.
--num_eval_examples: The number of test examples to use, use all examples in default.

Run with efficient long-context methods:

--attn_type: The attention type to use.
--kv_type: The KV cache type to use.

For example, run with MInference and SnapKV:

bash scripts/test_minference_with_snapkv.sh

The supported efficient long-context methods are as follows:

attn_type:

dense: Dense attention
minference: MInference
a_shape: A-Shape
tri_shape: Tri-Shape

kv_type:

dense: Dense KV cache
kivi: KIVI
snapkv: SnapKV
quest: Quest
pyramidkv: PyramidKV
streamingllm: StreamingLLM

You will need to build specific environment for different attention types and KV cache types, see section Environment for more details.

Dataset and Tasks Description

SCBench covers 12 diverse tasks that test four key long-context capabilities: string retrieval, semantic retrieval, global information processing, and multi-tasking.

String Retrieval

scbench_kv: Tests key-value lookup in large JSON objects with random, incompressible content
scbench_prefix_suffix: Evaluates finding strings with specific prefix and suffix patterns
scbench_vt: Assesses multi-hop variable tracing capabilities in long inputs

Semantic Retrieval

scbench_repoqa: Function retrieval from large codebases based on natural language descriptions
scbench_qa_eng, scbench_qa_chn, scbench_choice_eng: Includes English QA, Chinese QA, and multi-choice questions on long texts
Requires semantic understanding on length inputs

Global Information Processing

scbench_many_shot: Tests in-context learning with hundreds of examples
scbench_mf: Statistical tasks on large arrays
scbench_summary: Summarization of documents
Requires global information processing or aggregation

Multi-Tasking

scbench_summary_with_needles: Combines summarization with needle-in-haystack search
scbench_repoqa_and_kv: Integrates code function retrieval with key-value lookup
Requires multi-tasking or multi-step reasoning

Two Shared Context Modes

The benchmark evaluates these tasks across two shared context modes:

Multi-turn Mode: the default mode of our SCBench
Multi-request Mode: use --same_context_different_query to enable this mode

Environment for efficient long-context methods

Basic dependencies:

conda create -n scbench python=3.10 -y && conda activate scbench
pip install torch
pip install minference
pip install flash-attn --no-build-isolation
git clone https://github.com/microsoft/MInference.git && cd MInference/scbench
pip install -r requirements.txt

(Optional) Environment for efficient long-context methods

MInference natively supports many efficient long-context methods, but you will need to build specific environment for the following methods:

kivi:

bash setup/setup_kivi.sh

Hyper-parameters of efficient long-context methods

--attn_type

minference

best_pattern (loaded from config file)

a_shape

n_local (default: 3968)
n_init (default: 128)

tri_shape

n_local (default: 3968)
n_init (default: 128)
n_last (default: 100)

--kv_type

kivi

bits (default: 2)
group_size (default: 32)
residual_length (default: 32)

snapkv/pyramidkv

window_size (default: 32)
max_capacity_prompt (default: 4096)
kernel_size (default: 5)
pooling (default: "avgpool")

quest

chunk_size (default: 16)
token_budget (default: 1024)

streamingllm

n_local (default: 3968)
n_init (default: 128)

Note: All these parameters can be overridden by passing custom values in --hyper_param in cli, for example:

python run_multiturnbench.py .... --hyper_param '{"n_local": 4096}'

Compared to previous long-context benchmarks

Our SCBench is the first long-context benchmark that covers single-turn, multi-turn, and multi-request scenarios. In addition, our impelmentation also involves KV cache reuse techniques, thereby providing a more comprehensive analysis on the full KV cache lifecycle of efficient long-context methods.

Citation

@article{li2024scbench,
    title={SCBench: A KV cache-centric analysis of long-context methods},
    author={Li, Yucheng and Jiang, Huiqiang and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Zhang, Chengruidong and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili},
    journal={arXiv preprint arXiv:2412.10319},
    year={2024}
}

Name		Name	Last commit message	Last commit date
parent directory ..
scripts		scripts
setup		setup
args.py		args.py
cache_blend.yaml		cache_blend.yaml
compute_scores.py		compute_scores.py
eval_utils.py		eval_utils.py
readme.md		readme.md
repo_qa_utils.py		repo_qa_utils.py
requirements.txt		requirements.txt
run_scbench.py		run_scbench.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

SCBench

Load Data

Data Format

Experiments

Run the benchmark

Run with efficient long-context methods:

Dataset and Tasks Description

String Retrieval

Semantic Retrieval

Global Information Processing

Multi-Tasking

Two Shared Context Modes

Environment for efficient long-context methods

Basic dependencies:

(Optional) Environment for efficient long-context methods

Hyper-parameters of efficient long-context methods

--attn_type

--kv_type

Compared to previous long-context benchmarks

Citation

FilesExpand file tree

scbench

Directory actions

More options

Directory actions

More options

Latest commit

History

scbench

Folders and files

parent directory

readme.md

SCBench

Load Data

Data Format

Experiments

Run the benchmark

Run with efficient long-context methods:

Dataset and Tasks Description

String Retrieval

Semantic Retrieval

Global Information Processing

Multi-Tasking

Two Shared Context Modes

Environment for efficient long-context methods

Basic dependencies:

(Optional) Environment for efficient long-context methods

Hyper-parameters of efficient long-context methods

--attn_type

--kv_type

Compared to previous long-context benchmarks

Citation