🌐 Website | 📃 Paper | 𝕏 X (Twitter)
This is the official implementation for the paper "Unraveling Misinformation Propagation in LLM Reasoning", which explores how misinformation propagates in LLM reasoning.
We develop a pipeline to systematically evaluate the impact of misinformation on LLM reasoning, including its impact and mitigation.
Takeaways:
- LLMs follow misinformation by default.
- LLMs fail to correct misinformation even when instructed to do so.
- Fine-tuning early factual corrections are most effective to mitigate misinformation propagation, but still doesn't fully recover performance.
We recommend you to create a new conda environment for running codes in the repository. Replace {your_path} in err_prop_usr_misinfo.yml with your actual conda installation path. Then run the following commands:
conda env create -f err_prop_usr_misinfo.yml
conda activate err_prop_usr_misinfoIn this project, we need to use OpenAI API, Together AI API, and Huggingface User Access Token to access different language models. Follow their tutorials to create API keys and put them in api_key/config.json. See the GPT-4o-mini-sft API key in 3.2.
{
"openai_api_key": "...",
"togetherai_api_key": "...",
"huggingface_api_key": "...",
"gpt-4o-mini-sft": "..."
}python test_data_generation/main.pyWe start from raw testing data saved in raw_data/{dataset_name}/test.jsonl. The details of preprocessing are in Appendix D. The preprocessing process is implemented in the get_init_df function in scripts/test_data_generation/main.py.
After preprocessing, we retain only questions where the equation generation model (gpt-4-0613) produces correct answers to ensure the reliability of ground-truth equations. Additionally, to exclude overly simple questions, we filter out those with fewer than 5 CoT steps in their solutions.
Then we simulate misinformation by generating correct and relevant equations and then perturbing them using common human error patterns. Details are in Section 3.2 and Appendix B.
To generate the processed testing data, run the following command:
python test_data_generation/main.py --dataset_names gsm8k math mathqa metamath --api_config_file_path api_key/config.json --sample_size 100 --temperature 0.7 --top_p 0.7 --top_k 50 --number_of_outputs 1 (Sample size is set to 100 but the total number of questions is 100 * 4 (datasets) = 400.)
The processed data is saved in pcd_data/mix/test_400_perturbed.jsonl.
First, we analyze the impact of misinformation on final answers and reasoning behaviors of LLMs, including:
- Default behaviors under misinformation (Section 5.1):
- We analyze LLM's default behavior under misinformation and also explicitly instruct the models to follow misinformation.
- By default, LLMs treat misinformation as instructions and follow them, leading to incorrect answers.
- Instruct to correct misinformation (Section 5.2):
- We explicitly instruct LLMs to correct misinformation.
- LLMs still struggle to correct misinformation, even with explicit instructions.
Then, we explore how to mitigate misinformation via correction, including:
- Factors of effective correction (Section 6.1):
- We analyze the factors that affect the effectiveness of correction, including correction behaviors and positions of reasoning steps.
- We find that early factual corrections are effective to mitigate misinformation propagation.
- Fine-tuning with effective correction (Section 6.2):
- We fine-tune LLMs with effective correction behaviors to mitigate misinformation propagation.
- We find that fine-tuning with such correction behaviors significantly mitigates misinformation propagation.
PART I. Generate Reasoning Steps in Original and Misinformed Settings (Section 5)
Performance is evaluated under two conditions: without misinformation (original) and with misinformation (misinformed) to assess whether strong reasoning models can be misled, and we also don't provide explicit instructions, or instruct the models to follow or correct misinformation.
To collect models' reasoning results, run the following command:
python run_experiments.py --model_name {model_name} --dataset_name mix --sample_size 400 --temperature 0.7 --top_p 0.7 --top_k 50 --number_of_outputs 5 --api_config_file_path api_key/config.jsonThe {model_name} could be Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, Llama-3.2-11B-Vision-Instruct, Llama-3.2-90B-Vision-Instruct, Mixtral-8x7B-Instruct-v0.1, Mixtral-8x22B-Instruct-v0.1, Qwen2-72B-Instruct, gpt-4o-mini. The prediction includes prfx_q (original performance, without misinformation), prfx_q_prfx_pert (misinformed performance, with misinformation), and prfx_q_prfx_pert_both (misinformed performance with explicit correction instruction) and results are saved in pcd_data/mix/test_400_perturbed_premise_{model_name}.jsonl.
PART II. Generate Reasoning Steps in the Controlled Analysis (Section 6.1)
To perform the controlled analysis, run the following command:
python run_experiments_intervention.py --model_name {model_name} --dataset_name mix --sample_size 400 --temperature 0.7 --top_p 0.7 --top_k 50 --number_of_outputs 5 --api_config_file_path api_key/config.jsonThe {model_name} could be Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, Llama-3.2-11B-Vision-Instruct. The prediction includes prfx_q, prfx_q_prfx_pert, and prfx_q_prfx_pert_both, the same as run_experiments.py. There are also prfx_q_prfx_pert_pos (misinformed performance with factual corrections at different positions), prfx_q_prfx_pert_corr_bad (misinformed performance with non-factual corrections), prfx_q_prfx_pert_point_out_only_bad (misinformed performance with no corrections). Results are saved in pcd_data/mix/test_400_perturbed_premise_{model_name}_correction.jsonl.
PART III. Evaluate Reasoning Steps with Verifiers
We implement several automatic verifiers to analyze the reasoning behaviors of different models. To evaluate the correction and misinformation-following behaviors corresponding, run the following command:
python scripts/run_evaluation.py --model_names "meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo" "Qwen/Qwen2-72B-Instruct" "mistralai/Mixtral-8x7B-Instruct-v0.1" "mistralai/Mixtral-8x22B-Instruct-v0.1" "gpt-4o-mini" "meta-llama/Llama-3.2-1B-Instruct" "meta-llama/Llama-3.2-3B-Instruct" "meta-llama/Llama-3.2-11B-Vision-Instruct" --output_path "./exp_results/eval/test_400_perturbed_premise_evaluated.pkl"To evaluate the correction behaviors for the small model (Llama-3.2-1B) only (Appendix G), run the following command:
python scripts/run_evaluation.py --model_names "meta-llama/Llama-3.2-1B-Instruct" --output_path "./exp_results/eval/test_400_perturbed_premise_evaluated_1b.pkl"We also have three annotators to manually evaluate the correction behaviors to validate the effectiveness of automatic verifiers. Follow demonstration/annotation.ipynb and we gather the results in exp_results/eval/test_400_perturbed_premise_evaluated_annotated_final.pkl.
PART IV. Visualize the Results
First precompute the bootstrap results:
python scripts/precompute_bootstrap_results.pyResults are in exp_results/.
Then follow demonstration/result_visualization.ipynb to plot figures, which are saved in figures/.
PART I. Fine-tune the Model
To evaluate model performance corresponding to Section 6.2, first, run the first part (Fine-tuning) of cells in demonstration/finetune.ipynb. You will finetune a GPT-4o-mini model and get the model id. Save the id in api_key/config.json.
PART II. Generate Reasoning Steps in the Fine-tuning Analysis
With the saved id, run the following command:
python run_experiments_finetune.py --model_name gpt-4o-mini-sft --dataset_name mix --sample_size 400 --temperature 0.7 --top_p 0.7 --top_k 50 --number_of_outputs 5 --api_config_file_path api_key/config.jsonThe prediction includes base_original (original performance), base_misinformed (misinformed performance), inst_original (original performance with explicit instructions), inst_misinformed (misinformed performance with explicit instruction), ft_original (original performance with fine-tuning), ft_misinformed (misinformed performance with fine-tuning), inst_ft_original (original performance with both explicit instructions and fine-tuning), and inst_ft_misinformed (misinformed performance with both explicit instructions and fine-tuning). Results are saved in pcd_data/mix/test_4oo_perturbed_premise_{model_name}.jsonl. Note that the correction-specific data is saved in pcd_data/finetune/correction_training_set.jsonl. We will release the data collection code in the future.
PART III. Visualize the Results
Follow the Plot Sankey Graph part in demonstration/finetune.ipynb.
If you find our work useful in your research, please consider citing:
@inproceedings{feng-etal-2025-unraveling,
title = "Unraveling Misinformation Propagation in {LLM} Reasoning",
author = "Feng, Yiyang and
Wang, Yichen and
Cui, Shaobo and
Faltings, Boi and
Lee, Mina and
Zhou, Jiawei",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.627/",
pages = "11683--11707",
ISBN = "979-8-89176-335-7",
abstract = "Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning, positioning them as promising tools for supporting human problem-solving. However, what happens when their performance is affected by *misinformation*, i.e., incorrect inputs introduced by users due to oversights or gaps in knowledge? Such misinformation is prevalent in real-world interactions with LLMs, yet how it propagates within LLMs' reasoning process remains underexplored. Focusing on mathematical reasoning, we present a comprehensive analysis of how misinformation affects intermediate reasoning steps and final answers. We also examine how effectively LLMs can correct misinformation when explicitly instructed to do so. Even with explicit instructions, LLMs succeed less than half the time in rectifyingmisinformation, despite possessing correct internal knowledge, leading to significant accuracy drops (10.02{\%} {--} 72.20{\%}), and the degradation holds with thinking models (4.30{\%} {--} 19.97{\%}). Further analysis shows that applying factual corrections early in the reasoning process most effectively reduces misinformation propagation, and fine-tuning on synthesized data with early-stage corrections significantly improves reasoning factuality. Our work offers a practical approach to mitigating misinformation propagation."
}