Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A Self Improving + Autonomous Alignment Agent Correcting Other AI Agents when Showing Rogue Behaviors Performing ML Workflows. | Hackathon submission for SF Agent Hackathon

Notifications You must be signed in to change notification settings

glo26/agentshield

Repository files navigation

AgentShield

A Self-Improving + Autonomous Agent Made to Align ML Workflows.

Inspiration

In my work as an AI Alignment Fellow, I ran through AI Agents performing ML workflows showing several critical rogue behaviors; most common rogue behaviors are data_leakage, access_dev_set, and agent_impatience. Thus, in this hackathon, I am presenting Agent Shield.

What is AgentShield?

Agent Shield is a self-improving, fully autonomous AI Agent that helps other AI agents specifically conducting ML workflows in this case stay aligned to what a golden trajectories without rogue behaviors should look like.

Tech Stack:

Claude Agent SDK [Opus] - The agent performs autonomous threat analysis by comparing agent trajectories against learned baseline patterns and taxonomy of 11 rogue behaviors. Makes independent decisions on whether agents are malicious with 95%+ confidence, no human intervention required.

Redis MCP: We used to store clean workflow patterns during training phase and caches known violations for instant recognition. Speeds up repeated detections from 8 seconds to 0.8 seconds (10x faster), making ShieldAgent self-improving through continuous learning.

Skyflow: We utilized Skyflow to tokenize sensitive data (API keys, passwords, secrets) before Claude analysis, reducing token count by 60% and ensuring PCI/HIPAA compliance. Protects credentials from leaking during security analysis while cutting inference costs.

This is a Next.js project bootstrapped with create-next-app.

Getting Started

First, run the development server:

npm run dev
# or
yarn dev
# or
pnpm dev
# or
bun dev

Open http://localhost:3000 with your browser to see the result.

You can start editing the page by modifying app/page.tsx. The page auto-updates as you edit the file.

This project uses next/font to automatically optimize and load Geist, a new font family for Vercel.

Learn More

To learn more about Next.js, take a look at the following resources:

You can check out the Next.js GitHub repository - your feedback and contributions are welcome!

Deploy on Vercel

Challenges While Building:

Integration was challenging, but we overcame it just fine.

Accomplishments that we're proud of

Deploying what actually works and seems to be able to self-correct trajectories showing rogue behaviors to zero rogue behaviors. If it continues to self-correct this way, then it can eventually produce its own guardrails and keep track of other AI Agents performing ML workflows.

Learning Takeaway:

We are at the tip of an iceberg here with AI Agent Security, and we are just beginning to shield AI agents in a way that is contain-able.

What's next for Agent Shield

To turn this into a policy that can be implemented by other ML-performing AI Agents and publish a paper/code surrounding step-level security benchmark for AI Agents in Production, following Steca https://arxiv.org/abs/2502.14276 | https://github.com/WangHanLinHenry/STeCa

Contact:

E. [email protected] LinkedIn: https://linkedin.com/in/gloriafelicia

The easiest way to deploy your Next.js app is to use the Vercel Platform from the creators of Next.js.

Check out our Next.js deployment documentation for more details.

About

A Self Improving + Autonomous Alignment Agent Correcting Other AI Agents when Showing Rogue Behaviors Performing ML Workflows. | Hackathon submission for SF Agent Hackathon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published