With the rapid evolution of Large Language Models (LLMs), LLM-based agents and Multi-agent Systems (MAS) have significantly expanded the capabilities of AI ecosystems. However, this advancement has introduced more complex trustworthiness issues. This repository provides an overview of our survey paper "A Survey on Trustworthy LLM Agents: Threats and Countermeasures" and offers insights into threats, defenses, and evaluation techniques for LLM agents.
Title: A Survey on Trustworthy LLM Agents: Threats and Countermeasures
Authors: Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, et al.
Institution: Squirrel AI Learning, Salesforce, The University of North Carolina, Nanyang Technological University, Rutgers University
Link: arXiv / Paper URL
- 📌 Introduction
- 📄 Survey
- 💎 Table of Contents
- ✨ Paper Supplement
- 🚀 Highlights
- 🏗️ TrustAgent Framework
- 📌 Key Topics
- 📖 Papers
- 🔍 Comparison with Previous Surveys
- 📥 Citation
- 📢 Contributing
- 📧 Contact
If you have additional articles, please:
👉 Click here to submit your article
We will continuously update the survey and appreciate your support and contribution!
- Introduces TrustAgent, a modular framework for analyzing the trustworthiness of LLM-based agents.
- Categorizes trust issues into intrinsic (brain, memory, tools) and extrinsic (user, agent, environment) aspects.
- Surveys attacks, defenses, and evaluation techniques in multi-agent systems.
- Provides a taxonomy of threats, including adversarial hijacking, unsafe action chains, privacy leakage, hallucinations, and fairness biases.
- Brain (LLM Reasoning Module): Jailbreak, Prompt Injection, Backdoor Attacks.
- Memory: Memory Poisoning, Privacy Leakage, Short-Term Memory Misuse.
- Tools: Tool Manipulation, Tool Abuse, Malicious API Calls.
- Agent-to-Agent: Cooperative Attacks, Infectious Attacks, MAS Security.
- Agent-to-User: Personalized Attacks, Transparency Issues, Trust Calibration.
- Agent-to-Environment: Safety in Robotics, Autonomous Driving, Digital Threats.
-
"Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents" (2023)
Zihao Wang et al. Paper -
"Certifying llm safety against adversarial prompting" (arxiv 2023)
Kumar et al. Paper -
"Universal and transferable adversarial attacks on aligned language models" (arxiv 2023)
Zou et al. Paper -
"Improved techniques for optimization-based jailbreaking on large language models" (arxiv 2024)
Jia et al. Paper -
"AttnGCG: Enhancing jailbreaking attacks on LLMs with attention manipulation" (arxiv 2024)
Wang et al. Paper -
"Mrj-agent: An effective jailbreak agent for multi-round dialogue" (arxiv 2024)
Wang et al. Paper -
"PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage" (arxiv 2024)
Nie et al. Paper -
"Evil geniuses: Delving into the safety of llm-based agents" (arxiv 2023)
Tian et al. Paper -
"Pandora: Detailed llm jailbreaking via collaborated phishing agents with decomposed reasoning" (ICLR 2024)
Chen et al. Paper -
"Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast" (ICML 2024)
Gu et al. Paper -
"The wolf within: Covert injection of malice into mllm societies via an mllm operative" (arxiv 2024)
Tan et al. Paper -
"Prompt Injection attack against LLM-integrated Applications" (arxiv 2023)
Liu et al. Paper -
"Ignore previous prompt: Attack techniques for language models" (arxiv 2022)
Perez et al. Paper -
"Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection" (ACM Workshop on Artificial Intelligence and Security 2023)
Greshake et al. Paper -
"Automatic and universal prompt injection attacks against large language models" (arxiv 2024)
Liu et al. Paper -
"Optimization-based prompt injection attack to llm-as-a-judge" (ACM SIGSAC 2024)
Shi et al. Paper -
"Abusing images and sounds for indirect instruction injection in multi-modal LLMs" (arxiv 2023)
Bagdasaryan et al. Paper -
"Breaking agents: Compromising autonomous llm agents through malfunction amplification" (arxiv 2024)
Zhang et al. Paper -
"A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluations" (arxiv 2025)
Zhou et al. Paper -
"Watch out for your agents! investigating backdoor threats to llm-based agents" (arxiv 2024)
Yang et al. Paper -
"DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent" (arxiv 2025)
Zhu et al. Paper -
"BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems" (arxiv 2025)
Yu et al. Paper
-
"Moral Alignment for LLM Agents" (arxiv 2024)
Tennant et al. Paper -
"LLM agents in interaction: Measuring personality consistency and linguistic alignment in interacting populations of large language models" (arxiv 2024)
Frisch et al. Paper -
"Self-alignment of large language models via multi-agent social simulation" (ICLR 2024)
Pang et al. Paper -
"Aligning llm agents by learning latent preference from user edits" (arxiv 2024)
Gao et al. Paper -
"Large Language Model Assissted Multi-Agent Dialogue for Ontology Alignment" (AAMAS 2024)
Zhang et al. Paper -
"Embedding-based classifiers can detect prompt injection attacks" (arxiv 2024)
Ayub et al. Paper -
"SLM as Guardian: Pioneering AI Safety with Small Language Models" (arxiv 2024)
Kwon et al. Paper -
"Struq: Defending against prompt injection with structured queries" (arxiv 2024)
Chen et al. Paper -
"Shieldlm: Empowering llms as aligned, customizable and explainable safety detectors" (arxiv 2024)
Zhang et al. Paper -
"Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning" (arxiv 2024)
Xiang et al. Paper -
"AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration" (arxiv 2025)
Chen et al. Paper -
"Improving factuality and reasoning in language models through multiagent debate" (ICML 2023)
Du et al. Paper -
"Good Parenting is all you need--Multi-agentic LLM Hallucination Mitigation" (arxiv 2024)
Kwartler et al. Paper -
"Autodefense: Multi-agent llm defense against jailbreak attacks" (arxiv 2024)
Zeng et al. Paper
-
"Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents" (arxiv 2024)
Zhan et al. Paper -
"Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents" (NeurIPS 2025)
Debenedetti et al. Paper -
"DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent" (arxiv 2025)
Zhu et al. Paper -
"Redagent: Red teaming large language models with context-aware autonomous language agent" (arxiv 2024)
Xu et al. Paper -
"Riskawarebench: Towards evaluating physical risk awareness for high-level planning of llm-based embodied agents" (arxiv 2024)
Zhu et al. Paper -
"RedCode: Risky Code Execution and Generation Benchmark for Code Agents" (NeurIPS 2024)
Guo et al. Paper -
"S-eval: Automatic and adaptive test generation for benchmarking safety evaluation of large language models" (arxiv 2024)
Yuan et al. Paper -
"Bells: A framework towards future proof benchmarks for the evaluation of llm safeguards" (arxiv 2024)
Dorn et al. Paper -
"Agent-SafetyBench: Evaluating the Safety of LLM Agents" (arxiv 2024)
Zhang et al. Paper -
"Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents" (arxiv 2024)
Zhang et al. Paper -
"Agentharm: A benchmark for measuring harmfulness of llm agents" (arxiv 2024)
Andriushchenko et al. Paper -
"R-judge: Benchmarking safety risk awareness for llm agents" (arxiv 2024)
Yuan et al. Paper
-
“Certifiably Robust RAG against Retrieval Corruption” (arxiv 2024)
Chong Xiang et al. Paper -
“AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases” (NeurIPS 2025)
Zhaorun Chen et al. Paper -
“PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models” (arxiv 2024)
Wei Zou et al. Paper -
“Poisoning Retrieval Corpora by Injecting Adversarial Passages” (arxiv 2023)
Zexuan Zhong et al. Paper -
“AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents” (arxiv 2024)
Zhexin Zhang et al. Paper -
“Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast” (ICML 2024)
Xiangming Gu et al. Paper -
“Typos that Broke the RAG’s Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations” (arxiv 2024)
Sukmin Cho et al. Paper -
“ Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks” (arxiv 2025)
Ang Li et al. Paper -
“The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)” (arxiv 2024)
Shenglai Zeng et al. Paper -
“ Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation” (arxiv 2024)
Maya Anderson et al. Paper -
“RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks” (arxiv 2024)
Changyue Jiang et al. Paper -
“Text Embeddings Reveal (Almost) As Much As Text” (arxiv 2023)
John X Morris et al. Paper -
“Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence” (arxiv 2023)
Haoran Li et al. Paper -
“Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack” (arxiv 2024)
Mark Russinovich et al. Paper -
“LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet” (arxiv 2024)
Nathaniel Li et al. Paper -
“LEVERAGING THE CONTEXT THROUGH MULTI-ROUND INTERACTIONS FOR JAILBREAKING ATTACKS” (arxiv 2024)
Yixin Cheng et al. Paper -
“FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)” (arxiv 2024)
Aman Priyanshu et al. Paper -
“Prompt Leakage effect and defense strategies for multi-turn LLM interactions” (arxiv 2024)
Divyansh Agarwal et al. Paper -
“Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers” (arxiv 2024)
Terry Tong et al. Paper
-
“TrustRAG: Enhancing Robustness and Trustworthiness in RAG” (arxiv 2025)
Huichi Zhou et al. Paper -
“On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains” (arxiv 2024)
Xun Xian et al. Paper -
“Prompt Leakage effect and defense strategies for multi-turn LLM interactions” (arxiv 2024)
Divyansh Agarwal et al. Paper -
“AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents” (arxiv 2024)
Zhexin Zhang et al. Paper -
““Ghost of the past”: Identifying and Resolving Privacy Leakage of LLM’s Memory Through Proactive User Interaction” (arxiv 2024)
Shuning Zhang et al. Paper -
“ Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation” (arxiv 2024)
Maya Anderson et al. Paper -
“Certifiably Robust RAG against Retrieval Corruption” (arxiv 2024)
Chong Xiang et al. Paper -
“Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots” (RAID 2023)
Bocheng Chen et al. Paper
-
“PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models” (arxiv 2024)
Wei Zou et al. Paper -
“AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases” (NeurIPS 2025)
Zhaorun Chen et al. Paper -
“The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)” (arxiv 2024)
Shenglai Zeng et al. Paper -
“RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks” (arxiv 2024)
Changyue Jiang et al. Paper
-
"An Evaluation Mechanism of LLM-based Agents on Manipulating APIs" (EMNLP 2024)
Liu et al. Paper -
"AI-and LLM-driven search tools: A paradigm shift in information access for education and research" (Journal of Information Science 2024)
Chowdhury et al. Paper -
"Ufo: A UI-focused agent for Windows OS interaction" (arxiv 2024)
Zhang et al. Paper -
"Easytool: Enhancing LLM-based agents with concise tool instruction" (arxiv 2024)
Yuan et al. Paper -
"LLM with tools: A survey" (arxiv 2024)
Shen et al. Paper -
"ToolQA: A dataset for LLM question answering with external tools" (NeurIPS 2023)
Zhuang et al. Paper -
"Agent-SafetyBench: Evaluating the Safety of LLM Agents" (arxiv 2024)
Zhang et al. Paper -
"Security Attacks on LLM-based Code Completion Tools" (arxiv 2024)
Cheng et al. Paper -
"Imprompter: Tricking LLM Agents into Improper Tool Use" (arxiv 2024)
Fu et al. Paper -
"Misusing tools in large language models with visual adversarial examples" (arxiv 2023)
Fu et al. Paper -
"Breaking agents: Compromising autonomous LLM agents through malfunction amplification" (arxiv 2024)
Zhang et al. Paper -
"From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection" (arxiv 2024)
Wang et al. Paper -
"Mimicking the Familiar: Dynamic Command Generation for Information Theft Attacks in LLM Tool-Learning System" (arxiv 2025)
Jiang et al. Paper -
"Attacks on third-party APIs of large language models" (arxiv 2024)
Zhao et al. Paper -
"LLM agents can autonomously exploit one-day vulnerabilities" (arxiv 2024)
Fang et al. Paper -
"BadAgent: Inserting and activating backdoor attacks in LLM agents" (arxiv 2024)
Wang et al. Paper -
"Refusal-trained LLMs are easily jailbroken as browser agents" (arxiv 2024)
Kumar et al. Paper
-
"GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning" (arxiv 2024)
Xiang et al. Paper -
"AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration" (arxiv 2025)
Chen et al. Paper
-
"Toolsword: Unveiling safety issues of large language models in tool learning across three stages" (arxiv 2024)
Ye et al. Paper -
"InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents" (arxiv 2024)
Zhan et al. Paper -
"AgentHarm: A benchmark for measuring harmfulness of LLM agents" (arxiv 2024)
Andriushchenko et al. Paper -
"PrivacyLens: Evaluating privacy norm awareness of language models in action" (NeurIPS 2025)
Shao et al. Paper -
"Identifying the risks of LM agents with an LM-emulated sandbox" (arxiv 2023)
Ruan et al. Paper -
"Haicosystem: An ecosystem for sandboxing safety risks in human-AI interactions" (arxiv 2024)
Zhou et al. Paper
-
“Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities” (arxiv 2024)
Tianjie Ju et al. Paper -
“Red-Teaming LLM Multi-Agent Systems via Communication Attacks” (arxiv 2025)
Pengfei He et al. Paper -
“MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate” (arxiv 2024) Alfonso Amayuelas et al. Paper
-
“Evil Geniuses: Delving into the Safety of LLM-based Agents” (arxiv 2023)
Yu Tian et al. Paper -
“PROMPT INFECTION: LLM-TO-LLM PROMPT INJECTION WITHIN MULTI-AGENT SYSTEMS” (arxiv 2024)
Donghyun Lee et al. Paper -
“CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models” (arxiv 2024)
Zhenhong Zhou et al. Paper -
“Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast” (ICML 2024)
Xiangming Gu et al. Paper -
“The Wolf Within: Covert Injection of Malice into MLLM Societies via An MLLM Operative” (arxiv 2024)
Zhen Tan et al. Paper -
“NetSafe: Exploring the Topological Safety of Multi-agent Network” (arxiv 2024)
Miao Yu et al. Paper -
“NetSafe: Exploring the Topological Safety of Multi-agent Network” (arxiv 2024)
Miao Yu et al. Paper
-
“BlockAgents: Towards Byzantine-Robust LLM-Based Multi-Agent Coordination via Blockchain” (TURC 2024)
Bei Chen et al. Paper -
“Audit-LLM: Multi-Agent Collaboration for Log-based Insider Threat Detection” (arxiv 2024)
Chengyu Song et al. Paper -
“Combating Adversarial Attacks with Multi-Agent Debate” (arxiv 2024)
Steffi Chern et al. Paper -
“AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks” (arxiv 2024)
Yifan Zeng et al. Paper -
“Large Language Model Sentinel: LLM Agent for Adversarial Purification” (arxiv 2024)
Guang Lin et al. Paper -
“PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety” (arxiv 2024)
Zaibin Zhang et al. Paper -
“GPTSwarm: Language Agents as Optimizable Graphs” (ICML 2024)
Mingchen Zhug et al. Paper -
“G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems” (arxiv 2025)
Shilong Wang et al. Paper
-
“SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents” (arxiv 2024)
Sheng Yin et al. Paper -
“R-Judge: Benchmarking Safety Risk Awareness for LLM Agents” (arxiv 2024)
Tongxin Yuan et al. Paper -
“JAILJUDGE: A COMPREHENSIVE JAILBREAK JUDGE BENCHMARK WITH MULTI-AGENT ENHANCED EXPLANATION EVALUATION FRAMEWORK” (arxiv 2024)
Fan Liu et al. Paper
-
“Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents” (ICRA 2024)
Ziyi Yang et al. Paper -
“SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models” (arxiv 2024)
Yi Wu et al. Paper -
“Enhancing LLM-based Autonomous Driving Agents to Mitigate Perception Attacks” (arxiv 2024)
Ruoyu Song et al. Paper -
“ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles” (CVF 2024)
Jiawei Zhang et al. Paper -
“Autonomous Industrial Control using an Agentic Framework with Large Language Models” (arxiv 2024)
Javal Vyas et al. Paper -
“Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents” (arxiv 2024)
Zihan Liu et al. Paper
-
“LLM Agents can Autonomously Hack Websites” (arxiv 2024)
Richard Fang et al. Paper -
“AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents” (NeurIPS 2025)
Edoardo Debenedett et al. Paper -
“GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning” (arxiv 2024)
Zhen Xiang et al. Paper -
“Polaris: A Safety-focused LLM Constellation Architecture for Healthcare” (arxiv 2024)
Subhabrata Mukherjee et al. Paper -
“Position: Standard Benchmarks Fail – LLM Agents Present Overlooked Risks for Financial Applications” (arxiv 2025)
Zichen Chen et al. Paper -
“ENHANCING ANOMALY DETECTION IN FINANCIAL MARKETS WITH AN LLM-BASED MULTI-AGENT FRAMEWORK” (arxiv 2024)
Taejin Park. Paper -
“A Hybrid Attention Framework for Fake News Detection with Large Language Models” (NLPCC 2024)
Korir Nancy Jeptoo et al. Paper -
“Safeguarding Decentralized Social Media: LLM Agents for Automating Community Rule Compliance” (arxiv 2024)
Lucio La Cava et al. Paper
-
“The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies” (arxiv 2024)
Feng He et al. Paper -
“Privacy Leakage Overshadowed by Views of AI: A Study on Human Oversight of Privacy in Language Model Agent” (arxiv 2024)
Zhiping Zhang et al. Paper -
“EMPOWERING USERS IN DIGITAL PRIVACY MANAGEMENT THROUGH INTERACTIVE LLM-BASED AGENTS” (arxiv 2024)
Bolun Sun et al. Paper
| Survey | Object | Multi-Dimension | Modular | Technique | MAS |
|---|---|---|---|---|---|
| Liu et al. | LLM | ✅ | ❌ | Atk/Eval | ❌ |
| Huang et al. | LLM | ✅ | ❌ | Eval | ❌ |
| He et al. | Agent | ❌ | ❌ | Atk/Def | ❌ |
| Li et al. | Agent | ✅ | ❌ | Atk | ❌ |
| Wang et al. | Agent | ❌ | ❌ | Atk | ❌ |
| Deng et al. | Agent | ❌ | ✅ | Atk/Def | ✅ |
| Gan et al. | Agent | ✅ | ❌ | Atk/Def/Eval | ❌ |
| TrustAgent (Ours) | LLM + Agent | ✅ | ✅ | Atk/Def/Eval | ✅ |
If you find this survey useful for your research, please cite us:
@article{yu2025survey,
title={A Survey on Trustworthy LLM Agents: Threats and Countermeasures},
author={Yu, Miao and Meng, Fanci and Zhou, Xinyun and Wang, Shilong and Mao, Junyuan and Pang, Linsey and Chen, Tianlong and Wang, Kun and Li, Xinfeng and Zhang, Yongfeng and others},
journal={arXiv preprint arXiv:2503.09648},
year={2025}
}We welcome contributions! Feel free to submit issues or pull requests to improve the repository.
For any questions or discussions, please reach out to the authors:
- Miao Yu: [email protected]
- Xinfeng Li: [email protected]
- Qingsong Wen: [email protected]