Thanks to visit codestin.com
Credit goes to github.com

Skip to content

This repository contains the GSM-DC benchmark and evaluation framework for analyzing how Large Language Models (LLMs) are distracted by irrelevant context in math reasoning tasks. GSM-DC constructs symbolic DAG-based math problems with fine-grained distractor injections, enabling stepwise evaluation of reasoning robustness.

Notifications You must be signed in to change notification settings

yminglai/GSM-DC

Repository files navigation

How is LLM Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark (EMNLP 2025 Main)


GSM-DC is synthetic data generator and evaluator, developed to study the reasoning robustness of LLMs under irrelevant context.

Pipeline

Overview

This repository provides tools to:

  • Generate symbolic math problems as dependency graphs (DAGs)
  • Inject irrelevant context (IC) in a controlled manner
  • Render problems into natural language using GSM8K-style templates
  • Evaluate LLM responses at each reasoning step using a symbolic parser
  • Visualize graphs and reasoning pipelines

Example usage: see example_ic.ipynb


Dataset Structure and Pipeline

Each problem is represented as a tuple: (G', M, P, S)

  • G': Augmented graph with distractors
  • M: Natural language problem
  • P: Reasoning path
  • S: Ground-truth solution

Problem and Graph

When calling id_gen.gen_prob(), a Problem instance is initialized:

  • Graph Class: Manages DAG construction
  • Problem Class: Adds parameters, values, entity names, and text rendering
  • Graph stored as id_gen.problem.G using NumPy boolean matrices
  • Nodes: tuples (i, j, k, l) with meanings:
    • RNG: (-1, 0, 0, 0)
    • Instance Parameter: (0, j, k, l)
    • Abstract Parameter: (1, j, k, l)
  • Dependency graph: id_gen.problem.template, a networkx.DiGraph
  • Value map: id_gen.problem.lookup
  • Entity names: id_gen.problem.N[i][j]
  • Drawing tools: id_gen.problem.draw()

Sample


Benchmarking LLM Robustness

We systematically benchmark six models to study the impact of irrelevant context (IC) on multi-step reasoning tasks:

  • Closed-source models: Grok-3-Beta, GPT-4.1, GPT-4o-mini
  • Open-source models: LLaMA-3.3-70B, LLaMA-3.1-8B, LLaMA-3.2-1B

Each model is evaluated using a five-shot prompting strategy, enhanced with a structured Background section that highlights necessary dependencies.

We vary the number of injected irrelevant nodes ((m = 1 ext{–} 15)) across four reasoning depths ((rs = 2, 3, 4, 5)) and compute:

  • Step Accuracy (SAcc): Are all reasoning steps correct?
  • Path Accuracy (PAcc): Is the full reasoning path valid?
  • Extraction Accuracy (EAcc): Is the final answer correct?

Each point is averaged over 100 generated problems per configuration.


Close Source Results

Figure: Step accuracy of six LLMs under increasing irrelevant context. Left: Grok-3-Beta, GPT-4.1, GPT-4o-mini (closed-source models); Right: LLaMA-3.3-70B, LLaMA-3.1-8B, LLaMA-3.2-1B (open-source models). Each curve represents a reasoning depth (rs \in {2, 3, 4, 5}).


Training the Process Reward Model (PRM)

The PRM is trained on a dataset labeled by true_correct() in tools/irr_tools_test.py. To create this dataset:

  • Run generate_dataset.py
  • For each generated problem:
    • Model generates stepwise CoT output
    • Each step is scored: steps after the first mistake are labeled as incorrect

Use prm_train.py to train the PRM using this dataset.


Tree-of-Thoughts (ToT) Search with PRM

During evaluation, we use a PRM-guided Tree-of-Thought search (prm_tree.py).

  • N: Initial number of root paths
  • M: Beam width per path
  • K = N / M: Top-K continuations explored per step

This guided search improves robustness, especially under high IC.

Tree Search


Benchmarking LLM Robustness

We benchmarked six LLMs:

  • Closed models: Grok-3-Beta, GPT-4.1, GPT-4o-mini
  • Open models: LLaMA-3.3-70B, LLaMA-3.1-8B, LLaMA-3.2-1B

Prompting strategy:

  • 5-shot examples
  • Structured Background section that outlines key quantities and dependencies

Metrics:

  • SAcc: Step Accuracy
  • PAcc: Path Accuracy
  • EAcc: Extraction Accuracy

Evaluation Pipeline

To test your own LLM using the GSM-DC dataset:

  • Run test_batch_saved.py
  • Set MODEL_PATH and DATASET_PATH to your HuggingFace repo or checkpoint (Relase after paper submission and review)
  • Optionally enable tree search and PRM reranking

Acknowledgements

This project is inspired by and builds upon:

  • GSM8K by OpenAI
  • iGSM by Facebook Research
  • PRM by Stephen Diehl

(This repo is forked from iGSM by using their hierarchical entity vocabulary and graph QA construction.)


Citation

@misc{yang2025llmreasoningdistractedirrelevant,
      title={How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark}, 
      author={Minglai Yang and Ethan Huang and Liang Zhang and Mihai Surdeanu and William Wang and Liangming Pan},
      year={2025},
      eprint={2505.18761},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18761}, 
}

About

This repository contains the GSM-DC benchmark and evaluation framework for analyzing how Large Language Models (LLMs) are distracted by irrelevant context in math reasoning tasks. GSM-DC constructs symbolic DAG-based math problems with fine-grained distractor injections, enabling stepwise evaluation of reasoning robustness.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published