Codestin Search App

NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

🤗 Dataset on HuggingFace
📄 Official Paper

NeedleChain is a benchmark designed to evaluate LLMs' intact long-context understanding. Every provided context consists of query-relevant information, requiring a comprehensive understanding to answer the given query.

Data Genration

We placed sample data used in our experiments in the ./data folder. For those interested in our work, we provide all code to allow flexible configuration of our benchmark. Anyone can generate custom data by executing the code below.

python make_data.py
--k 5 \                # number of needles for each chain
--n 200 \              # number of chain for each dataset
--val 1600 \           # standard salary value for each needle
--results_dir "./data" # data save path

Inference

Our codebase aligns with the vLLM framework. We conducted experiments using eight RTX A6000 GPUs, and the requirements in our experimental setup were as follows.

torch==2.6.0+cu124
vllm==0.8.5
transformers==4.53.1
openai==1.97.0

- Model serving (openai-compatible): For HF models only

nohup python model_serve.py --model_name QwQ > logs/model &

For model_name config, refer utils.py for more details

model_arg_dict = {
    'qwen2.5-32B':   'Qwen/Qwen2.5-32B-Instruct',
    'qwen3-32B':     'Qwen/Qwen3-32B',
    'llama3.3-70B':  'meta-llama/Llama-3.3-70B-Instruct',
    'llama3.1-DS':   'deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
    'qwen2.5-DS':    'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
    'qwen_long':     'Qwen/QwenLong-L1-32B',
    'QwQ':           'Qwen/QwQ-32B',
}

- Inference (For HF, after model serving)

python inference_call.py \
--model_name QwQ \          # refer model_arg_dict
--output_name QwQ_output \  # anything - desired output name
--chain_type forward \      # Choice: [forward, backward, chaotic(mixed), parallel(needlestack)]
--question_type single \    # Choice: [single, total]
--k 10 \                    # Choice: [5, 10, 20, 50, 100, 200]
--results_dir ./results     # anything - desired result path dir

python inference_call.py \
--model_name gpt-4o \       # refer inference_call.py
--output_name gpt_output \  # anything - desired output name
--chain_type forward \      # Choice: [forward, backward, chaotic(mixed), parallel(needlestack)]
--question_type single \    # Choice: [single, total]
--k 10 \                    # Choice: [5, 10, 20, 50, 100, 200]
--results_dir ./results     # anything - desired result path dir

Evaluation

Note that, results_dir in evaluate.py must be accurately specified.

python evaluate.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
asset		asset
chat_templates		chat_templates
data		data
README.md		README.md
evaluate.py		evaluate.py
inference_all.py		inference_all.py
inference_call.py		inference_call.py
make_data.py		make_data.py
model_serve.py		model_serve.py
run_openai.py		run_openai.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Data Genration

Inference

- Model serving (openai-compatible): For HF models only

- Inference (For HF, after model serving)

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

hyeonseokk/NeedleChain

Folders and files

Latest commit

History

Repository files navigation

NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Data Genration

Inference

- Model serving (openai-compatible): For HF models only

- Inference (For HF, after model serving)

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages