Keyu Wang*, Jin Li*, Shu Yang, Zhuoran Zhang, Di Wang
(*Contribute equally)
LLMs often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs through:
- Behavioral Analysis: Simple opinion statements reliably induce sycophancy, whereas user expertise framing has negligible impact
- Mechanistic Analysis: Two-stage emergence via (1) late-layer output preference shift and (2) deeper representational divergence
- Perspective Analysis: First-person prompts ("I believe...") induce higher sycophancy than third-person framings ("They believe...")
Our findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
.
βββ experiments/
β βββ behavioral_analysis/ # Section: "User Opinion Induces Sycophancy"
β β βββ run_syco.py # Main behavioral experiments
β β βββ run_syco.slurm
β βββ mechanistic_analysis/ # Section: "Mechanistic Analysis"
β β βββ run_syco_logit_cot.py # Logit-lens & activation patching
β β βββ run_syco_logit_cot.slurm
β βββ data_generation/ # Prefix & prompt generation
β βββ generate_prefixes.py
β βββ full_question_builder.py
β βββ apply_prefixes.py
βββ utils/
β βββ DataFrameAligner.py
β βββ SycophancyAnalysis.py
β βββ EarlyDecodingAnalysis.py
βββ lib/ # Dataset library
β βββ plain/ # Plain baseline questions
β βββ opinion_only/ # Opinion-only questions
β β βββ prefix/ # With prefix
β β βββ suffix/ # With suffix
β βββ pov/ # Perspective (1st/3rd person)
β βββ prefix/
β β βββ first_pov/ # First-person prompts
β β βββ third_pov/ # Third-person prompts
β βββ suffix/
β βββ first_pov/
β βββ three_pov/
βββ csv/
βββ requirements.txt
βββ README.md
βββ DATA_STRUCTURE.md # Detailed data organization guide
Note: For a detailed explanation of the
lib/data structure and file naming conventions, see DATA_STRUCTURE.md
# Clone the repository
git clone https://github.com/kaustpradalab/LLM-sycophancy.git
cd LLM-sycophancy
# Install dependencies
pip install -r requirements.txtThe experiments are organized by paper sections. Each section has standalone scripts that can be run independently.
Objective: Measure sycophancy rates across different prompt conditions (Plain, Opinion-only, Opinion + Expertise)
cd experiments/behavioral_analysis
# Plain questions (baseline)
python run_syco.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--question_type plain \
--input_filename ../../lib/plain/mmlu_plain.pkl
# Opinion-only questions
python run_syco.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--question_type opinion_only \
--input_filename ../../lib/opinion_only/prefix/mmlu_opinion_only.pkl
# Opinion + Expertise (First-person, Advanced)
python run_syco.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--question_type prefix_and_opinion \
--prefix_type academic \
--academic_level advanced \
--prefix_subtype original \
--input_filename ../../lib/pov/prefix/first_pov/mmlu_academic_opinion_advanced.pklKey Parameters:
--question_type:plain,opinion_only, orprefix_and_opinion--academic_level:beginner,intermediate, oradvanced--prefix_subtype:original(first-person) orthird_pov(third-person)
# Submit job to SLURM cluster
sbatch run_syco.slurm "meta-llama/Llama-3.1-8B-Instruct"
# Monitor job
squeue -u $USERPaper Findings (Figure 2, Figure 3):
- Opinion-only prompts induce ~63.7% average sycophancy rate
- Expertise level (Beginner/Intermediate/Advanced) has <4.4% impact
Objective: Understand how and when sycophancy emerges through layer-wise analysis
cd experiments/mechanistic_analysis
# Run with logit-lens analysis (all layers)
python run_syco_logit_cot.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--question_type opinion_only \
--inference_mode logit_only \
--inference_layer all \
--input_filename ../../lib/opinion_only/prefix/mmlu_opinion_only.pkl
# Run with specific layers (e.g., odd layers)
python run_syco_logit_cot.py \
--model_name Qwen/Qwen2.5-7B-Instruct \
--question_type plain \
--inference_layer odd \
--input_filename ../../lib/plain/mmlu_plain.pklKey Parameters:
--inference_mode:logit_only(just logits) orlogit_and_cot(+ chain-of-thought)--inference_layer:all,odd,even, orlast
Paper Findings (Figure 4, Figure 5, Figure 6):
- Decision score shift occurs at layers 16-19 (Llama 8B)
- KL divergence peaks at layer 23 (representational shift)
- Activation patching at critical layers reduces sycophancy by 36%
# Layer-wise decision tracking
python run_early_decoding.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--input_file ../../lib/plain/mmlu_plain.pklObjective: Compare first-person ("I believe") vs. third-person ("They believe") prompts
cd experiments/behavioral_analysis
# First-person (1st POV)
python run_syco.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--question_type prefix_and_opinion \
--prefix_type academic \
--academic_level advanced \
--prefix_subtype original \
--input_filename ../../lib/pov/prefix/first_pov/mmlu_academic_opinion_advanced.pkl
# Third-person (3rd POV)
python run_syco.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--question_type prefix_and_opinion \
--prefix_type academic \
--academic_level advanced \
--prefix_subtype third_pov \
--input_filename ../../lib/pov/prefix/third_pov/mmlu_academic_opinion_advanced.pklPaper Findings (Figure 8, Figure 9, Figure 10):
- First-person prompts induce 13.6% higher sycophancy on average
- KL divergence shows first-person creates stronger representational shifts
- Cosine similarity reveals orthogonal encoding of 1st vs. 3rd person perspectives
We tested 7 model families across similar parameter sizes:
| Model | Parameters | Family |
|---|---|---|
| Llama-3.1-8B-Instruct | 8B | Meta |
| Llama-3.2-1B, 3B | 1B, 3B | Meta |
| Qwen2.5-1.5B, 3B, 7B, 14B-Instruct | 1.5B-14B | Alibaba |
| Mistral-7B-Instruct-v0.3 | 7B | Mistral AI |
| Falcon-7B | 7B | TII |
| OLMoE-1B-7B-Instruct | 1B-7B | AI2 |
| OPT-6.7B | 6.7B | Meta |
| Pythia-6.9B | 6.9B | EleutherAI |
If you use this code or build upon our work, please cite:
@inproceedings{wang2026sycophancy,
title={When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models},
author={Wang, Keyu and Li, Jin and Yang, Shu and Zhang, Zhuoran and Wang, Di},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}Sycophantic behavior in LLMs is primarily triggered by the presence of a user opinion, regardless of the user's claimed expertise or authority.
Sycophancy emerges in two stages: (1) late-layer output preference shift compared to Plain, then (2) deep representational divergence, confirming opinion framing overrides learned knowledge both behaviorally and internally.
Expertise-level framing fails to influence behavior because models do not encode it internally: opinion prompts form distinct clusters while level prompts overlap, indicating expertise cues are ignored representationally.
Grammatical person is a key driver of sycophancy in LLMs. Changing prompts from first- to third-person framing substantially reduces sycophantic behavior, with this effect encoded deep within the model's representations.
This work was supported by KAUST funding BAS/1/1689-01-01, KAUST Center of Excellence for Generative AI (award 5940), and a gift from Google.
Special thanks to the AAAI 2026 reviewers for their valuable feedback.