Lists (2)
Sort Name ascending (A-Z)
Stars
TrustJudge is a probabilistic evaluation framework that reduces score-comparison and pairwise transitivity inconsistencies in LLM-as-a-judge systems.
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
AudioBench: A Universal Benchmark for Audio Large Language Models
A toolkit for processing speech data and creating speech datasets
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
Accessibility engine for automated Web UI testing
Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
Inference-time scaling for LLMs-as-a-judge.
Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vector…
Generuj nieskończony i zdywersyfikowany zbiór danych przy użyciu systemu agentowego!
Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthr…
A collection of sample agents built with Agent Development Kit (ADK)
A Medical / Clinical Note Taking Demo Application using Deepgram Voice Agent API
Evaluation and Tracking for LLM Experiments and AI Agents
Extendable toolkit for comprehensive evaluation of ASR systems. Currently supports benchmarking 29 system-models combination for Polish using BIGOS datasets.