-
University of Illinois Urbana-Champaign
- https://alphapav.github.io/
Highlights
- Pro
Stars
image scaling attacks for multi-modal prompt injection
AndroidWorld is an environment and benchmark for autonomous agents
An Illusion of Progress? Assessing the Current State of Web Agents
Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
[NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Repo for the paper "Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks".
Official release of code for the paper RL is a hammer and LLMs are nails A simple RL approach to stronger prompt injection attacks
Open-source implementation of AlphaEvolve
Get your documents ready for gen AI
[NeurIPS 2025] Latent Zoning Networks
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >74% on SWE-bench verified!
An open-source AI agent that brings the power of Gemini directly into your terminal.
🔮Reasoning for Safer Code Generation; 🥇Winner Solution of Amazon Nova AI Challenge 2025
An open-source AI agent that lives in your terminal.
👩⚖️ Agent-as-a-Judge: The Magic for Open-Endedness
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.
Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
🌐 Make websites accessible for AI agents. Automate tasks online with ease.
Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025 Spotlight)
Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations.
A benchmark for LLMs on complicated tasks in the terminal
An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google's Gemini, OpenAI, and XAI APIs.
BigOBench assesses the capacity of Large Language Models (LLMs) to comprehend time-space computational complexity of input or generated code.
🚀 The fast, Pythonic way to build MCP servers and clients
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Pocket Flow: 100-line LLM framework. Let Agents build Agents!
AgentCoder: multi-agent code generation framework.