Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Official Repository for the Research Paper

This repository contains the official implementation and experimental code for the paper "Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability" published in Transactions on Machine Learning Research (TMLR) 2025.

Key Findings

🔍 Literature Analysis: Analysis of 4,886 papers citing Chain of Thought reveals:

7,635 different benchmarks used across papers
No single benchmark used by more than 25% of evaluations
Significant fragmentation in evaluation approaches

📊 Experimental Results: Comprehensive evaluation across 6 inference-time methods, 5 models, and 10 benchmarks shows:

Methods effective on weaker models (e.g., Llama-3.1-8B) often fail to improve stronger models (e.g., GPT-4o, Claude-3.5-Sonnet)
Performance varies significantly across different benchmark domains
High computational costs (3.5x to 54x API calls) limit practical deployment

Repository Structure

├── evaluated_methods/          # Implementations of inference-time methods
├── literature_analysis/        # Automated analysis of 4,886 papers
├── data/                      # Benchmark datasets (see Data Directory section)
├── utils/                     # Utility functions and helpers
├── assets/                    # Figures and supplementary materials
└── README.md                  # This file

Methods Evaluated

Our framework includes implementations of six prominent inference-time methods:

Chain of Thought (Wei et al., 2022) - Step-by-step reasoning prompts
Self-Consistency (Wang et al., 2023) - Multiple reasoning paths with majority voting
ReAct (Yao et al., 2023) - Reasoning and acting with language models
Tree of Thoughts (Yao et al., 2024) - Tree-structured problem exploration
Graph of Thoughts (Besta et al., 2024) - Graph-based reasoning networks
LLM Multi-Agent Debate (Du et al., 2024) - Collaborative multi-agent reasoning

Models and Benchmarks

Models Tested

Advanced: GPT-4o, Claude-3.5-Sonnet
Widely-used: GPT-3.5-turbo
Open-weights: Llama-3.1-8B-Instruct, Mixtral-8x22B

Benchmarks

Mathematical Reasoning: GSM8K, GSM-Symbolic, AQuA, SVAMP
General Knowledge: MMLU, TruthfulQA
Domain-Specific: MedQA, LegalBench
Specialized Tasks: Sorting, Document Merging

Data Directory

For convenience, we have included portions of the benchmark datasets in the data/ directory. While the complete datasets are available from their original sources (Hugging Face or open source repositories), the included data allows for quick experimentation and testing.

The data/ directory contains the following benchmark datasets:

AQuA
Document Merging
GSM8K
GSM-Symbolic
LegalBench
MMLU
MedQA
SVAMP
Sorting 032
TruthfulQA

Note: These are partial datasets included for convenience. For complete datasets and the most up-to-date versions, please refer to:

Hugging Face Datasets
Original dataset repositories cited in our paper
The benchmark-specific documentation in each method's evaluation scripts

Installation

# Clone the repository
git clone https://github.com/mmjerge/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework

# Create conda environment
conda env create -f environment.yaml
conda activate llm-eval

Running Open Source Models with vLLM

For open source models, we use vLLM to serve models locally.

Resource Requirements

Different models have varying resource requirements:

Mixtral-8x22B: Requires 4 A100 GPUs, 8 cores, ~700 GB memory
Llama-3.1-8B-Instruct: Can run on a single A100 GPU with appropriate memory

Setting Up vLLM Server

For Mixtral-8x22B (Large Model)

# Serve Mixtral 8x22B with tensor parallelism across 4 GPUs
vllm serve <path-to-mixtral-8x22b> \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --max-model-len 4000 \
  --enforce-eager \
  --max-parallel-loading-workers 1

For Llama-3.1-8B-Instruct (Smaller Model)

# Serve Llama-3.1-8B on single GPU
vllm serve "meta-llama/Llama-3.1-8B-Instruct" \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

Running Evaluations with vLLM

Once the vLLM server is running (default endpoint: http://localhost:8000/v1), you can run various inference methods:

ReAct Evaluation

python evaluated_methods/ReAct/evaluations/scripts/langchain_gsm8k_react_api_count.py \
  --model vllm \
  --vllm_url http://localhost:8000/v1 \
  --model_name <model-path> \
  --num_questions 150

Unaided Baseline

python evaluated_methods/unaided/scripts/huggingface_evaluate_all.py \
  --vllm-models <model-path> \
  --datasets "legal-bench-privacy_policy_qa" "medqa" \
  --samples 100 \
  --vllm-endpoint "http://localhost:8000/v1"

Chain of Thought

python evaluated_methods/chain-of-thought/scripts/medqa_cot.py \
  --providers huggingface \
  --huggingface-models <model-path> \
  --num-samples 150 \
  --debug \
  --output-dir medqa_results

Tree of Thoughts

python evaluated_methods/tree-of-thought-llm/run.py \
  --backend meta-llama/Llama-3.1-8B-Instruct \
  --temperature 0.7 \
  --task medqa \
  --method_generate sample \
  --method_evaluate value \
  --method_select greedy \
  --prompt_sample standard \
  --n_generate_sample 5 \
  --n_evaluate_sample 1 \
  --n_select_sample 1

Reproducing the Literature Analysis

Our paper includes a comprehensive analysis of 4,886 papers citing Chain of Thought (Wei et al., 2022). You can reproduce this analysis using the tools in the literature_analysis/ directory.

Prerequisites

To run the literature analysis, you'll need:

OpenAI API Token (required) - Used for GPT-4o automated analysis of papers
Semantic Scholar API Token (optional but recommended) - For enhanced paper retrieval and metadata

Setup API Keys

# Set your OpenAI API key
export OPENAI_API_KEY="your_openai_api_key_here"

# Set your Semantic Scholar API key (optional)
export SEMANTIC_SCHOLAR_API_KEY="your_semantic_scholar_api_key_here"

Research Contributions

1. Comprehensive Literature Survey

Automated analysis of 4,886 papers using GPT-4o
Systematic categorization of evaluation practices
Identification of evaluation fragmentation issues

2. Systematic Experimental Evaluation

First comprehensive comparison across multiple state-of-the-art models
Evaluation on diverse benchmark suite including novel domains
Cost analysis revealing practical deployment challenges

3. Reproducibility Assessment

Direct comparison with original paper results
Documentation of reproducibility challenges
Recommendations for standardized evaluation protocols

Citation

If you use this work in your research, please cite:

@article{
  jerge2025pitfalls,
  title={Pitfalls in Evaluating Inference-time Methods for Improving {LLM} Reliability},
  author={Michael M. Jerge and David Evans},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2025},
  url={https://openreview.net/forum?id=xeGWsmqFS8},
  note={Reproducibility Certification, Survey Certification}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions about the paper or code, please contact:

Michael Jerge: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 324 Commits
__pycache__		__pycache__
assets		assets
data		data
evaluated_methods		evaluated_methods
literature_analysis		literature_analysis
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml

Folders and files

Latest commit

History

Repository files navigation

Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Key Findings

Repository Structure

Methods Evaluated

Models and Benchmarks

Models Tested

Benchmarks

Data Directory

Installation

Running Open Source Models with vLLM

Resource Requirements

Setting Up vLLM Server

For Mixtral-8x22B (Large Model)

For Llama-3.1-8B-Instruct (Smaller Model)

Running Evaluations with vLLM

ReAct Evaluation

Unaided Baseline

Chain of Thought

Tree of Thoughts

Reproducing the Literature Analysis

Prerequisites

Setup API Keys

Research Contributions

1. Comprehensive Literature Survey

2. Systematic Experimental Evaluation

3. Reproducibility Assessment

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages