🔥 Must-read papers for harmful fine-tuning attacks/defenses for LLMs.
💫 Continuously update on a weekly basis. (last update: 2025/06/07)
🔥 Good news: 7 harmful fine-tuning related papers are accpeted by NeurIPS2024
💫 We have updated our survey, including the discussion on the 17 ICLR2025 new submissions.
🔥 We update a slide to introduce harmful fine-tuning attacks/defenses. Check out the slide here.
🔥 Good news: 12 harmful fine-tuning related papers are accpeted by ICLR2025. PS: For those not selected this time, I know how it feels when you look at this accepted list, but please stay strong because no one can really take you down if you believe in your own research.
- Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
- 
[2023/10/4] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models arXiv [paper] [code] 
- 
[2023/10/5] Fine-tuning aligned language models compromises safety, even when users do not intend to! ICLR 2024 [paper] [code] 
- 
[2023/10/5] On the Vulnerability of Safety Alignment in Open-Access LLMs ACL2024 (Findings) [paper] 
- 
[2023/10/22] Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases arXiv [paper] 
- 
[2023/10/31] Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b SeT LLM workshop@ ICLR 2024 [paper] 
- 
[2023/11/9] Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL2024 [paper] 
- 
[2024/4/1] What's in your" safe" data?: Identifying benign data that breaks safety COLM2024 [paper] [code] 
- 
[2024/6/28] Covert malicious finetuning: Challenges in safeguarding llm adaptation ICML2024 [paper] 
- 
[2024/7/29] Can Editing LLMs Inject Harm? NeurIPS2024 [paper] [code] 
- 
[2024/10/01] Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Modelss arXiv [paper] [code] 
- 
[2024/10/21] The effect of fine-tuning on language model toxicity NeurIPS2024 Safe GenAI workshop [paper] 
- 
[2024/10/23] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks arXiv [paper] 
- 
[2025/01/29] Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation arXiv [paper] [code] 
- 
[2025/02/03] The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models arXiv [paper] 
- 
[2025/02/20] Fundamental Limitations in Defending LLM Finetuning APIs arXiv [paper] 
- 
[2025/02/26] No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data arXiv [paper] 
- 
[2025/03/05] Emergent Misalignment:Narrow finetuning can produce broadly misaligned LLMs arXiv [paper] 
- 
[2025/05/1] Tongue-Tied: Breaking LLMs Safety Through New Language Learning CALCS arXiv [paper] 
- 
[2025/05/11] Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety ICML2025 [paper] [code] 
- 
[2024/2/2] Vaccine: Perturbation-aware alignment for large language model aginst harmful fine-tuning NeurIPS2024 [paper] [code] 
- 
[2024/5/23] Representation noising effectively prevents harmful fine-tuning on LLMs NeurIPS2024 [paper] [code] 
- 
[2024/5/24] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation arXiv [paper] [code] [Openreview] 
- 
[2024/8/1] Tamper-Resistant Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [paper] [code] 
- 
[2024/9/3] Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation ICLR2025 [paper] [code] [Openreview] 
- 
[2024/9/26] Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning NeurIPS2024 (for diffusion model) [paper] 
- 
[2024/10/05] Identifying and Tuning Safety Neurons in Large Language Models ICLR2025 [Openreview] 
- 
[2024/10/13] Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation arXiv [paper] [code] 
- 
[2024/10/13] Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy NeurIPS2024 workshop SafeGenAi [paper] 
- 
[2025/01/19] On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment arXiv [paper] [code] 
- 
[2025/02/07] Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond arXiv [paper] 
- 
[2025/05/07] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization arXiv [paper] 
- 
[2025/05/18] Self-Destructive Language Model arXiv [paper] 
- 
[2025/05/22] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning arXiv [paper] [code] 
- 
[2025/05/22] Model Immunization from a Condition Number Perspective ICML2025 [paper] [code] 
- 
[2025/06/04] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning ICML2025 [paper] [code] 
- 
[2023/8/25] Fine-tuning can cripple your foundation model; preserving features may be the solution TMLR [paper] [code] 
- 
[2023/9/14] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR2024 [paper] [code] 
- 
[2024/2/3] Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML2024 [paper] [code] 
- 
[2024/2/7] Assessing the brittleness of safety alignment via pruning and low-rank modifications ME-FoMo@ICLR2024 [paper] [code] 
- 
[2024/2/22] Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment NeurIPS2024 [paper] [code] 
- 
[2024/2/28] Keeping llms aligned after fine-tuning: The crucial role of prompt templates NeurIPS2024 [paper] [code] 
- 
[2024/5/28] Lazy safety alignment for large language models against harmful fine-tuning NeurIPS2024 [paper] [code] 
- 
[2024/6/10] Safety alignment should be made more than just a few tokens deep ICLR2025 [paper] [code] [Openriew] 
- 
[2024/6/12] Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR2025 [paper] [Openreview] 
- 
[2024/8/27] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models ICLR2025 [Openreview] [paper] 
- 
[2024/8/30] Safety Layers in Aligned Large Language Models: The Key to LLM Security ICLR2025 [Openreview] [paper] 
- 
[2024/10/05] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection ICLR2025 [Openreview] 
- 
[2024/10/05] Safety Alignment Shouldn't Be Complicated preprint [Openreview] 
- 
[2024/10/05] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation ICLR2025 [paper] [Openreview] 
- 
[2024/10/05] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning ICLR2025 [paper] [Openreview] 
- 
[2024/10/13] Safety-Aware Fine-Tuning of Large Language Models NeurIPS 2024 Workshop on Safe Generative AI [paper] 
- 
[2024/12/19] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response arXiv [paper] 
- 
[2025/02/28] Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs arXiv [paper] 
- 
[2025/03/03] Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness arXiv [paper] 
- 
[2025/03/24] LookAhead Tuning: Safer Language Models via Partial Answer Previews arXiv [paper] [code] 
- 
[2025/04/12] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function arXiv [paper] [code] 
- 
[2025/04/14] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? arXiv [paper] 
- 
[2025/05/22] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization arXiv [paper] [code] 
- 
[2025/05/22] Shape it Up! Restoring LLM Safety during Finetuning arXiv [paper] 
- 
[2025/05/29] SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA arXiv [paper] 
- 
[2024/2/19] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic ACL2024 [paper] [code] 
- 
[2024/3/8] Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv [paper] [code] 
- 
[2024/5/15] A safety realignment framework via subspace-oriented model fusion for large language models KBS [paper] [code] 
- 
[2024/5/23] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability NeurIPS2024 [paper] [code] 
- 
[2024/5/27] Safe lora: the silver lining of reducing safety risks when fine-tuning large language models NeurIPS2024 [paper] 
- 
[2024/8/18] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning ICML2025 [paper] 
- 
[2024/10/05] Locking Down the Finetuned LLMs Safety preprint [Openreview] 
- 
[2024/10/05] Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models ICLR2025 [Openreview] [code] 
- 
[2024/10/05] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models preprint [Openreview] 
- 
[2024/12/15] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models arXiv [paper] 
- 
[2024/12/17] NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI2025 [paper] [code] 
- 
[2024/12/30] Enhancing AI Safety Through the Fusion of Low Rank Adapters arXiv [paper] 
- 
[2025/02/01] Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation arXiv [paper] [repo] 
- 
[2025/02/24] Safety Misalignment Against Large Language Models NDSS2025 [paper] [repo] 
- 
[2025/03/06] SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging ICLR2025 (short paper) [paper] [repo] 
- 
[2025/04/13] Alleviating the Fear of Losing Alignment in LLM Fine-tuning S&P2025 [paper] [repo] 
- 
[2025/05/17] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo] 
- [2024/5/25] No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks arXiv [paper]
- [2024/5/27] Navigating the safety landscape: Measuring risks in finetuning large language models NeurIPS2024 [paper]
- [2024/10/05] Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs preprint [Openreview]
- [2024/10/05] On Evaluating the Durability of Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [Code]
- [2024/11/13] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv [paper]
- [2025/2/3] Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities arXiv [paper]
- [2025/3/24] Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models arXiv [paper]
- [2024/6/15] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models ICLR2025 [paper] [Openreview]
- [2024/11/28] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning arXiv [paper]
If you find this repository useful, please cite our paper:
@article{huang2024harmful,
  title={Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey},
  author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
  journal={arXiv preprint arXiv:2409.18169},
  year={2024}
}
If you discover any papers that are suitable but not included, please contact Tiansheng Huang ([email protected]).
Please kindly 🌟star🌟 our repository if you find it helpful!