This repository is the official implementation of the NeurIPS 2025 paper SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations, by Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, and René Vidal.
Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.
Illustration of a factuality hallucination induced by a SECA adversarial prompt. The top two green boxes show the full attack prompt based on the original MMLU question in elementary mathematics, followed by the faithful and factual response from the target LLM. The bottom two blue boxes present a SECA-generated adversarial variant of the original prompt, with edits highlighted in red, and the corresponding target LLM explanation, which includes red-highlighted hallucinated content. In this example, the model selects the incorrect choice ('B') and generates a hallucinated explanation, showcasing a factuality hallucination.
To install requirements:
pip install -r requirements.txt
LLM Name | Source / API Version |
---|---|
Llama-3-3B | https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct |
Llama-3-8B | https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct |
Llama-2-13B | https://huggingface.co/meta-llama/Llama-2-13b-chat-hf |
Qwen-2.5-7B | https://huggingface.co/Qwen/Qwen2.5-7B-Instruct |
Qwen-2.5-14B | https://huggingface.co/Qwen/Qwen2.5-14B-Instruct |
GPT-4o-mini | gpt-4o-mini-2024-07-18 (API) |
GPT-4.1-nano | gpt-4.1-nano-2025-04-14 (API) |
GPT-4.1-mini | gpt-4.1-mini-2025-04-14 (API) |
GPT-4.1 | gpt-4.1-2025-04-14 (API) |
Please download the MMLU dataset from https://huggingface.co/datasets/cais/mmlu. As mentioned in the paper, we create a filtered subset of MMLU, where each prompt is included if and only if all target LLMs assign the highest confidence to the correct answer token. Please refer to data/filtered_MMLU_index.json for the exact questions used in our paper. Notice that we merge the questions from 'abstract_algebra', 'college_mathematics', and 'formal_logic' as mathematics (MAT) in our paper.
This repository includes a demo notebook demo.ipynb that demonstrates how SECA attacks Llama-3-8B. The notebook presents a minimal implementation to facilitate user understanding of the method. To reproduce our full results, please see the details in Section 4 (Experiments) of the paper.
Core Task Settings
Argument | Default | Description |
---|---|---|
--mmlu_subject |
'machine_learning' |
Subject domain from the MMLU benchmark (e.g., philosophy , clinical_knowledge ). |
--mmlu_question_idx |
0 |
Index of the MMLU question to attack. |
--mmlu_dataset_split_type |
'test' |
Dataset split to use (train , dev , or test ). |
--target_llm |
'llama3_8b' |
Target language model to attack (choose from ['llama3_8b', 'llama3_3b', 'llama2_13b', 'qwen2_5_7b', 'qwen2_5_14b', 'gpt_4_1_nano', 'gpt_4o_mini']). |
SECA Configuration
Argument | Default | Description |
---|---|---|
--max_iteration |
30 |
Maximum number of optimization steps for finding adversarial prompts. |
--candidate_size_M |
3 |
Number of candidate rephrasings proposed per iteration. (See Line 6 in Algorithm 1.) |
--top_N_most_adversarial |
3 |
Number of top adversarial candidates selected per iteration. (See Line 10 in Algorithm 1.) |
--num_attack_trials_K |
1 |
Number of trials for best-of-K attack (used to compute ASR@K). |
--termination_confidence_threshold |
1.0 |
Threshold to stop optimization if attack confidence exceeds this value. |
Other Settings
Argument | Default | Description |
---|---|---|
--rng_seed |
42 |
Random seed for reproducibility. |
--verbose |
False |
Enables verbose logging if specified. |
If you find our work useful, please consider citing our paper:
@inproceedings{liang2025seca,
title={SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations},
author={Liang, Buyun and Peng, Liangzu and Luo, Jinqi and Thaker, Darshan and Chan, Kwan Ho Ryan and Vidal, Ren{\'e}},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
For convenience, you may also cite the arXiv version:
@article{liang2025seca,
title={SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations},
author={Liang, Buyun and Peng, Liangzu and Luo, Jinqi and Thaker, Darshan and Chan, Kwan Ho Ryan and Vidal, Ren{\'e}},
journal={arXiv preprint arXiv:2510.04398},
year={2025}
}
For questions or bug reports, please either:
- Open an issue in this GitHub repository, or
- Email Buyun Liang at:
byliang [at] seas [dot] upenn [dot] edu
The code is licensed under the MIT License. See LICENSE for details.
The code and documentation in this repository are made available for research and educational purposes only, with no warranties or guarantees. Users are fully responsible for ensuring their work complies with applicable laws, regulations, and ethical standards. The authors disclaim any liability for misuse, damage, or harm resulting from the use of this material.