-
UIUC
- Illinois
-
01:48
(UTC -06:00) - hanningzhang.github.io
Stars
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
The official implementation of "Self-play LLM Theorem Provers with Iterative Conjecturing and Proving"
Post-training with Tinker
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Build resilient language agents as graphs.
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
[ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
AIRA-dojo: a framework for developing and evaluating AI research agents
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
Build effective agents using Model Context Protocol and simple workflow patterns
SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024]
SWE-bench: Can Language Models Resolve Real-world Github Issues?
AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.
An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
A toolkit for developing and comparing reinforcement learning algorithms.
[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
The official implementation of "ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering"
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
Fully open reproduction of DeepSeek-R1
Fully open data curation for reasoning models
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.
Google Research
WIP - Automated Question Answering for ArXiv Papers with Large Language Models (https://arxiv.taesiri.xyz/)