Official Repository for the Research Paper
This repository contains the official implementation and experimental code for the paper "Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability" published in Transactions on Machine Learning Research (TMLR) 2025.
🔍 Literature Analysis: Analysis of 4,886 papers citing Chain of Thought reveals:
- 7,635 different benchmarks used across papers
- No single benchmark used by more than 25% of evaluations
- Significant fragmentation in evaluation approaches
📊 Experimental Results: Comprehensive evaluation across 6 inference-time methods, 5 models, and 10 benchmarks shows:
- Methods effective on weaker models (e.g., Llama-3.1-8B) often fail to improve stronger models (e.g., GPT-4o, Claude-3.5-Sonnet)
- Performance varies significantly across different benchmark domains
- High computational costs (3.5x to 54x API calls) limit practical deployment
├── evaluated_methods/ # Implementations of inference-time methods
├── literature_analysis/ # Automated analysis of 4,886 papers
├── data/ # Benchmark datasets (see Data Directory section)
├── utils/ # Utility functions and helpers
├── assets/ # Figures and supplementary materials
└── README.md # This file
Our framework includes implementations of six prominent inference-time methods:
- Chain of Thought (Wei et al., 2022) - Step-by-step reasoning prompts
- Self-Consistency (Wang et al., 2023) - Multiple reasoning paths with majority voting
- ReAct (Yao et al., 2023) - Reasoning and acting with language models
- Tree of Thoughts (Yao et al., 2024) - Tree-structured problem exploration
- Graph of Thoughts (Besta et al., 2024) - Graph-based reasoning networks
- LLM Multi-Agent Debate (Du et al., 2024) - Collaborative multi-agent reasoning
- Advanced: GPT-4o, Claude-3.5-Sonnet
- Widely-used: GPT-3.5-turbo
- Open-weights: Llama-3.1-8B-Instruct, Mixtral-8x22B
- Mathematical Reasoning: GSM8K, GSM-Symbolic, AQuA, SVAMP
- General Knowledge: MMLU, TruthfulQA
- Domain-Specific: MedQA, LegalBench
- Specialized Tasks: Sorting, Document Merging
For convenience, we have included portions of the benchmark datasets in the data/ directory. While the complete datasets are available from their original sources (Hugging Face or open source repositories), the included data allows for quick experimentation and testing.
The data/ directory contains the following benchmark datasets:
- AQuA
- Document Merging
- GSM8K
- GSM-Symbolic
- LegalBench
- MMLU
- MedQA
- SVAMP
- Sorting 032
- TruthfulQA
Note: These are partial datasets included for convenience. For complete datasets and the most up-to-date versions, please refer to:
- Hugging Face Datasets
- Original dataset repositories cited in our paper
- The benchmark-specific documentation in each method's evaluation scripts
# Clone the repository
git clone https://github.com/mmjerge/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
# Create conda environment
conda env create -f environment.yaml
conda activate llm-evalFor open source models, we use vLLM to serve models locally.
Different models have varying resource requirements:
- Mixtral-8x22B: Requires 4 A100 GPUs, 8 cores, ~700 GB memory
- Llama-3.1-8B-Instruct: Can run on a single A100 GPU with appropriate memory
# Serve Mixtral 8x22B with tensor parallelism across 4 GPUs
vllm serve <path-to-mixtral-8x22b> \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--max-model-len 4000 \
--enforce-eager \
--max-parallel-loading-workers 1# Serve Llama-3.1-8B on single GPU
vllm serve "meta-llama/Llama-3.1-8B-Instruct" \
--max-model-len 8192 \
--gpu-memory-utilization 0.95Once the vLLM server is running (default endpoint: http://localhost:8000/v1), you can run various inference methods:
python evaluated_methods/ReAct/evaluations/scripts/langchain_gsm8k_react_api_count.py \
--model vllm \
--vllm_url http://localhost:8000/v1 \
--model_name <model-path> \
--num_questions 150python evaluated_methods/unaided/scripts/huggingface_evaluate_all.py \
--vllm-models <model-path> \
--datasets "legal-bench-privacy_policy_qa" "medqa" \
--samples 100 \
--vllm-endpoint "http://localhost:8000/v1"python evaluated_methods/chain-of-thought/scripts/medqa_cot.py \
--providers huggingface \
--huggingface-models <model-path> \
--num-samples 150 \
--debug \
--output-dir medqa_resultspython evaluated_methods/tree-of-thought-llm/run.py \
--backend meta-llama/Llama-3.1-8B-Instruct \
--temperature 0.7 \
--task medqa \
--method_generate sample \
--method_evaluate value \
--method_select greedy \
--prompt_sample standard \
--n_generate_sample 5 \
--n_evaluate_sample 1 \
--n_select_sample 1Our paper includes a comprehensive analysis of 4,886 papers citing Chain of Thought (Wei et al., 2022). You can reproduce this analysis using the tools in the literature_analysis/ directory.
To run the literature analysis, you'll need:
- OpenAI API Token (required) - Used for GPT-4o automated analysis of papers
- Semantic Scholar API Token (optional but recommended) - For enhanced paper retrieval and metadata
# Set your OpenAI API key
export OPENAI_API_KEY="your_openai_api_key_here"
# Set your Semantic Scholar API key (optional)
export SEMANTIC_SCHOLAR_API_KEY="your_semantic_scholar_api_key_here"- Automated analysis of 4,886 papers using GPT-4o
- Systematic categorization of evaluation practices
- Identification of evaluation fragmentation issues
- First comprehensive comparison across multiple state-of-the-art models
- Evaluation on diverse benchmark suite including novel domains
- Cost analysis revealing practical deployment challenges
- Direct comparison with original paper results
- Documentation of reproducibility challenges
- Recommendations for standardized evaluation protocols
If you use this work in your research, please cite:
@article{
jerge2025pitfalls,
title={Pitfalls in Evaluating Inference-time Methods for Improving {LLM} Reliability},
author={Michael M. Jerge and David Evans},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=xeGWsmqFS8},
note={Reproducibility Certification, Survey Certification}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions about the paper or code, please contact:
- Michael Jerge: [email protected]