🤗 Dataset on HuggingFace
📄 Official Paper
NeedleChain is a benchmark designed to evaluate LLMs' intact long-context understanding. Every provided context consists of query-relevant information, requiring a comprehensive understanding to answer the given query.
We placed sample data used in our experiments in the ./data folder.
For those interested in our work, we provide all code to allow flexible configuration of our benchmark.
Anyone can generate custom data by executing the code below.
python make_data.py
--k 5 \ # number of needles for each chain
--n 200 \ # number of chain for each dataset
--val 1600 \ # standard salary value for each needle
--results_dir "./data" # data save path
Our codebase aligns with the vLLM framework. We conducted experiments using eight RTX A6000 GPUs, and the requirements in our experimental setup were as follows.
- torch==2.6.0+cu124
- vllm==0.8.5
- transformers==4.53.1
- openai==1.97.0
nohup python model_serve.py --model_name QwQ > logs/model &
For model_name config, refer utils.py for more details
model_arg_dict = {
'qwen2.5-32B': 'Qwen/Qwen2.5-32B-Instruct',
'qwen3-32B': 'Qwen/Qwen3-32B',
'llama3.3-70B': 'meta-llama/Llama-3.3-70B-Instruct',
'llama3.1-DS': 'deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
'qwen2.5-DS': 'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
'qwen_long': 'Qwen/QwenLong-L1-32B',
'QwQ': 'Qwen/QwQ-32B',
}python inference_call.py \
--model_name QwQ \ # refer model_arg_dict
--output_name QwQ_output \ # anything - desired output name
--chain_type forward \ # Choice: [forward, backward, chaotic(mixed), parallel(needlestack)]
--question_type single \ # Choice: [single, total]
--k 10 \ # Choice: [5, 10, 20, 50, 100, 200]
--results_dir ./results # anything - desired result path dir
python inference_call.py \
--model_name gpt-4o \ # refer inference_call.py
--output_name gpt_output \ # anything - desired output name
--chain_type forward \ # Choice: [forward, backward, chaotic(mixed), parallel(needlestack)]
--question_type single \ # Choice: [single, total]
--k 10 \ # Choice: [5, 10, 20, 50, 100, 200]
--results_dir ./results # anything - desired result path dir
Note that, results_dir in evaluate.py must be accurately specified.
python evaluate.py