evals

Star

Here are 93 public repositories matching this topic...

mastra-ai / mastra

Star

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

nodejs javascript typescript ai reactjs mcp nextjs tts chatbots workflows agents llm evals

Updated Nov 16, 2025
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Nov 14, 2025
Jupyter Notebook

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

Updated Oct 30, 2025
Python

Kiln-AI / Kiln

Star

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

Updated Nov 16, 2025
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Nov 14, 2025
Python

lmnr-ai / lmnr

Star

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

Updated Nov 15, 2025
TypeScript

mattpocock / evalite

Sponsor

Star

Evaluate your LLM-powered apps with TypeScript

typescript ai evals

Updated Nov 15, 2025
TypeScript

superlinear-ai / raglite

Star

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

markdown pdf postgres sqlite postgresql reranking rag vector-search duckdb colbert llm pgvector chainlit retrieval-augmented-generation evals late-interaction late-chunking query-adapter

Updated Nov 5, 2025
Python

keshik6 / HourVideo

Star

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

navigation perception summarization reasoning visual-reasoning egocentric-videos gpt-4 multiple-choice-questions benchmark-dataset video-language-understanding multimodal-large-language-models evals gemini-pro spatial-intelligence neurips-2024 1-hour-video-language-understanding long-form-video-language-understanding long-context-understanding

Updated Jul 12, 2025
Jupyter Notebook

microsoft / promptpex

Star

Test Generation for Prompts

testing evaluations prompt-engineering llms chatgpt evals gpt-4o

Updated Nov 16, 2025
TeX

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Nov 11, 2025
TypeScript

mclenhard / mcp-evals

Star

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai mcp evals