A curated paper list on safety in reasoning of Large Reasoning Models (LRMs). Research on safety in reasoning for LRMs is still in its early stages. This repository aims to track and document advancements in this evolving field. Stay tuned for updates!
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran. [pdf] [repo] [data], 2025.2.17
- Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, Ling Liu. [pdf] [repo], 2025.3.1
- The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang. [pdf], 2025.2.27(v3)
- BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack
Zihao Zhu, Hongbao Zhang, Mingda Zhang, Ruotong Wang, Guanzong Wu, Ke Xu, Baoyuan Wu. [pdf] [repo], 2025.2.16
- Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha. [pdf] [repo], 2025.2.18
- H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, Yiran Chen. [pdf] [repo], 2025.2.27(v2)
- Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, Nazneen Rajani. [pdf] [data], 2025.3.3
- GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi. [pdf] [data], 2025.1.31
- Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning
Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, Yisen Wang. [pdf], 2025.2.21(v2)
- OverThink: Slowdown Attacks on Reasoning LLMs
Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, Eugene Bagdasarian. [pdf], 2025.2.5(v2)
- o3-mini vs DeepSeek-R1: Which One is Safer?
Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura. [pdf], 2025.1.31(v2)
- Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies
Manojkumar Parmar, Yuvaraj Govindarajulu. [pdf], 2025.1.28
- MetaSC: Test-Time Safety Specification Optimization for Language Models
Víctor Gallego. [pdf], 2025.2.11
- Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Minhao Cheng, Dacheng Tao. [pdf], 2025.2.6
- Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models
Rubing Li, João Sedoc, Arun Sundararajan. [pdf], 2025.2.6
- Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models
Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, Evgeny Burnaev. [pdf], 2025.2.18
- Output Length Effect on DeepSeek-R1's Safety in Forced Thinking
Xuying Li, Zhuo Li, Yuji Kosuga, Victor Bian. [pdf], 2025.3.2
Reading lists related to LLMs/LRMs safety:
- LightChen233/Awesome-Long-Chain-of-Thought-Reasoning
- tjunlp-lab/Awesome-LLM-Safety-Papers
- ianitow/awesome-guardrails-llm-papers
- We acknowledge that some important works in this field may be missing from this list. We warmly welcome contributions to help us improve!
- If you would like to promote your work or suggest other relevant papers, please feel free to open an issue or submit a pull request(PR). Your contributions are greatly appreciated, and we thank you in advance for helping enhance this resource!
- Special thanks to Awesome-Efficient-Reasoning, which inspired the structure and template of this project.