Semantically Equivalent and Coherent Attacks (SECA)

This repository is the official implementation of the NeurIPS 2025 paper SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations, by Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, and René Vidal.

✨ Abstract

⚠️ Warning: This method may be misused for malicious purposes.

Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

Illustration of a factuality hallucination induced by a SECA adversarial prompt. The top two green boxes show the full attack prompt based on the original MMLU question in elementary mathematics, followed by the faithful and factual response from the target LLM. The bottom two blue boxes present a SECA-generated adversarial variant of the original prompt, with edits highlighted in red, and the corresponding target LLM explanation, which includes red-highlighted hallucinated content. In this example, the model selects the incorrect choice ('B') and generates a hallucinated explanation, showcasing a factuality hallucination.

📦 Requirements

To install requirements:

pip install -r requirements.txt

🤖 LLM Info

LLM Name	Source / API Version
Llama-3-3B	https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
Llama-3-8B	https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
Llama-2-13B	https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
Qwen-2.5-7B	https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Qwen-2.5-14B	https://huggingface.co/Qwen/Qwen2.5-14B-Instruct
GPT-4o-mini	gpt-4o-mini-2024-07-18 (API)
GPT-4.1-nano	gpt-4.1-nano-2025-04-14 (API)
GPT-4.1-mini	gpt-4.1-mini-2025-04-14 (API)
GPT-4.1	gpt-4.1-2025-04-14 (API)

📊 Dataset

Please download the MMLU dataset from https://huggingface.co/datasets/cais/mmlu. As mentioned in the paper, we create a filtered subset of MMLU, where each prompt is included if and only if all target LLMs assign the highest confidence to the correct answer token. Please refer to data/filtered_MMLU_index.json for the exact questions used in our paper. Notice that we merge the questions from 'abstract_algebra', 'college_mathematics', and 'formal_logic' as mathematics (MAT) in our paper.

🚀 Demo

This repository includes a demo notebook demo.ipynb that demonstrates how SECA attacks Llama-3-8B. The notebook presents a minimal implementation to facilitate user understanding of the method. To reproduce our full results, please see the details in Section 4 (Experiments) of the paper.

⚙️ Hyperparameters

Core Task Settings

Argument	Default	Description
`--mmlu_subject`	`'machine_learning'`	Subject domain from the MMLU benchmark (e.g., `philosophy`, `clinical_knowledge`).
`--mmlu_question_idx`	`0`	Index of the MMLU question to attack.
`--mmlu_dataset_split_type`	`'test'`	Dataset split to use (`train`, `dev`, or `test`).
`--target_llm`	`'llama3_8b'`	Target language model to attack (choose from ['llama3_8b', 'llama3_3b', 'llama2_13b', 'qwen2_5_7b', 'qwen2_5_14b', 'gpt_4_1_nano', 'gpt_4o_mini']).

SECA Configuration

Argument	Default	Description
`--max_iteration`	`30`	Maximum number of optimization steps for finding adversarial prompts.
`--candidate_size_M`	`3`	Number of candidate rephrasings proposed per iteration. (See Line 6 in Algorithm 1.)
`--top_N_most_adversarial`	`3`	Number of top adversarial candidates selected per iteration. (See Line 10 in Algorithm 1.)
`--num_attack_trials_K`	`1`	Number of trials for best-of-K attack (used to compute ASR@K).
`--termination_confidence_threshold`	`1.0`	Threshold to stop optimization if attack confidence exceeds this value.

Other Settings

Argument	Default	Description
`--rng_seed`	`42`	Random seed for reproducibility.
`--verbose`	`False`	Enables verbose logging if specified.

📖 Citation

If you find our work useful, please consider citing our paper:

@inproceedings{liang2025seca,
  title={SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations},
  author={Liang, Buyun and Peng, Liangzu and Luo, Jinqi and Thaker, Darshan and Chan, Kwan Ho Ryan and Vidal, Ren{\'e}},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

For convenience, you may also cite the arXiv version:

@article{liang2025seca,
  title={SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations},
  author={Liang, Buyun and Peng, Liangzu and Luo, Jinqi and Thaker, Darshan and Chan, Kwan Ho Ryan and Vidal, Ren{\'e}},
  journal={arXiv preprint arXiv:2510.04398},
  year={2025}
}

📬 Contact

For questions or bug reports, please either:

Open an issue in this GitHub repository, or
Email Buyun Liang at: byliang [at] seas [dot] upenn [dot] edu

📄 License

The code is licensed under the MIT License. See LICENSE for details.

🛡️ Disclaimer

The code and documentation in this repository are made available for research and educational purposes only, with no warranties or guarantees. Users are fully responsible for ensuring their work complies with applicable laws, regulations, and ethical standards. The authors disclaim any liability for misuse, damage, or harm resulting from the use of this material.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
requirements.txt		requirements.txt
seca_ex.png		seca_ex.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantically Equivalent and Coherent Attacks (SECA)

✨ Abstract

📦 Requirements

🤖 LLM Info

📊 Dataset

🚀 Demo

⚙️ Hyperparameters

📖 Citation

📬 Contact

📄 License

🛡️ Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

Buyun-Liang/SECA

Folders and files

Latest commit

History

Repository files navigation

Semantically Equivalent and Coherent Attacks (SECA)

✨ Abstract

📦 Requirements

🤖 LLM Info

📊 Dataset

🚀 Demo

⚙️ Hyperparameters

📖 Citation

📬 Contact

📄 License

🛡️ Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages