Codestin Search App

coze-dev / coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Nov 14, 2025
Go

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Nov 1, 2025
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Nov 14, 2025
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Nov 14, 2025
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Nov 15, 2025
Jupyter Notebook

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Nov 12, 2025
Python

shiragannavar / Testing-RAG

Star

evaluation ground-truth llm generative-ai agent-evaluation

Updated May 12, 2025
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

lml2468 / ContextOptimizer

Star

Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights

multi-agent-systems prompt-engineering agent-evaluation context-engineering agent-optimizer

Updated Jul 5, 2025
Python

JetBrains / teamcity-ai-agent-testing-demo

Star

End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI

ai evaluation eval evaluation-framework agentic-ai agent-evaluation evaluation-tools

Updated Aug 13, 2025
Kotlin

anaishowland / neurosim

Star

Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.

computer-vision evaluation-metrics evaluation-framework web-agent evals computer-use agent-evaluation

Updated Oct 29, 2025
Python

srikanthbaride / reflection-timing

Star

Experiments and analysis on reflection timing in reinforcement learning agents — exploring self-evaluation, meta-learning, and adaptive reflection intervals.

python machine-learning reflection latex reinforcement-learning research-paper meta-learning self-play agent-evaluation

Updated Oct 8, 2025
Python

Arc-Computer / CL-Bench

Star

Benchmark framework for evaluating LLM agent continual learning in stateful environments. Features production-realistic CRM workflows with multi-turn conversations, state mutations, and cross-entity relationships. Extensible to additional domains

benchmark continual-learning agent-evaluation

Updated Nov 14, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

Sai-Santhan-Dodda / ai-navigation-automation

Star

Browser automation agent for Bunnings website using the browser-use library, orchestrated via the laminar framework, managed with uv for Python environments, and running in Brave Browser for stealth and CAPTCHA bypass.

python automation gemini openai brave browser-automation laminar uv llms ollama stealth-browsing browser-use agent-evaluation

Updated Aug 10, 2025
Python

smuddana-7 / Cart-Pole-Gymnasium-Environment

Star

Train a reinforcement learning agent using PPO to balance a pole on a cart in the CartPole-v0 environment using Gymnasium and Stable-Baselines3. Includes model training, evaluation, and rendering using Python and Jupyter Notebook.

reinforcement-learning python3 agent-evaluation openai-gymnasium cartpole-v0-environment

Updated Jun 2, 2025
Jupyter Notebook

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Star

Visual dashboard to evaluate multi-agent & RAG-based AI apps. Compare models on accuracy, latency, token usage, and trust metrics - powered by NVIDIA AgentIQ

nvidia multi-agent-systems model-comparison production-ai rag streamlit trustworthy-ai llmops genai enterprise-ai llm-evaluation open-source-ai agent-evaluation agentiq pipeline-evaluation

Updated Apr 10, 2025
Python

Samuray49 / awesome-ai-agent-testing

Star

🤖 Explore vital resources for testing AI agents, including frameworks, tools, and best practices to enhance reliability and performance.

benchmark machine-learning socks5 shadowsocks china censorship gfw great-firewall ai-agents chaos-engineering dictator vtuber tiananmen vup covid-19 dictatorship xi-jinping agent-evaluation

Updated Nov 16, 2025

42olver / ai-agent-benchmark-compendium

Star

🛠️ Discover and explore over 50 benchmarks for AI agents across key categories, enhancing evaluation of function calling, reasoning, coding, and interactions.

data-science machine-learning reinforcement-learning artificial-intelligence code-examples evaluation-metrics performance-optimization testing-framework ai-research algorithm-comparison model-performance dataset-analysis ai-benchmark benchmarking-tools agent-evaluation

Updated Nov 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evaluation

Here are 21 public repositories matching this topic...

coze-dev / coze-loop

Giskard-AI / giskard-oss

truera / trulens

mozilla-ai / any-agent

rungalileo / agent-leaderboard

Cre4T3Tiv3 / ai-agents-reality-check

SparkBeyond / agentune

shiragannavar / Testing-RAG

chaosync-org / awesome-ai-agent-testing

lml2468 / ContextOptimizer

JetBrains / teamcity-ai-agent-testing-demo

anaishowland / neurosim

srikanthbaride / reflection-timing

Arc-Computer / CL-Bench

PabloCabaleiro / pondera

Sai-Santhan-Dodda / ai-navigation-automation

smuddana-7 / Cart-Pole-Gymnasium-Environment

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Samuray49 / awesome-ai-agent-testing

42olver / ai-agent-benchmark-compendium

Improve this page

Add this topic to your repo