Thanks to visit codestin.com
Credit goes to github.com

#

llm-evaluation

Here are 333 public repositories matching this topic...

Harshita1195 / Evaluating-Large-Language-Model-LLM-Metrics

Notebooks for evaluating LLM outputs using various metrics, covering scenarios with and without known ground truth. Includes criteria such as correctness, coherence, relevance, and more, providing a comprehensive approach to assess LLM performance accurately and efficiently.

llm langchain llm-evaluation

Updated Oct 15, 2024
Jupyter Notebook

GuyNachshon / SocialPsychologyArena

Large-Scale In-Silico Social-Psychology Experiments with LLM Ensembles

alignment ai-security llm llm-evaluation agentic-ai

Updated Aug 2, 2025
Python

kbhujbal / Large-Language-Models-Transformer---Code-Notes

python jupyter-notebook transformers gpt transformer-architecture langchain-python llm-training llm-evaluation

Updated Apr 16, 2025

qte77 / React-Playground

Research-backed AI safety testing for mental health applications for Healthcare Professionals and AI Researchers

ai homepage mental-health mental-health-awareness llm-evaluation

Updated Oct 20, 2025
TypeScript

AkshaySyal / Research

Research

gemini gpt claude llm-evaluation llm-benchmarking

Updated Oct 16, 2025
Python

ashleysally00 / agent_eval_testing_workflow

Agentic Workflow Evaluation: Text Summarization Agent. This project includes an AI agent evaluation workflow using a text summarization model with OpenAI API and Transformers library. It follows an iterative approach: generate summaries, analyze metrics, adjust parameters, and retest to refine AI agents for accuracy, readability, and performance.

machine-learning text-summarization semantic-similarity model-performance transformers-library openai-api ai-optimization ai-testing llm-evaluation ai-workflow agentic-ai ai-agent-evaluation ai-metrics readability-n

Updated Feb 23, 2025
Python

tsanders-rh / konveyor-iq

LLM evaluation framework for application modernization. Assesses AI-generated code fixes for Konveyor rule violations across functional correctness, quality, security, and explainability. Features Grafana-style reporting for cost estimation and model selection for migrations.

static-analysis code-quality java-migration application-modernization konveyor llm-evaluation

Updated Oct 15, 2025
Python

yukinagae / genkit-promptfoo-sample

Sample implementation demonstrating how to use Firebase Genkit with Promptfoo

testing evaluation prompts evaluation-framework llm llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework promptfoo genkit

Updated Sep 11, 2024
TypeScript

celebrity-death-bot

brucehart / celebrity-death-bot

Cloudflare Workers app that watches Wikipedia for newly reported notable deaths, LLM‑filters and de‑duplicates them, then publishes concise memorial posts (Telegram + X) via a lightweight public JSON API. Automates detection, verification, and multi‑platform distribution with low latency and minimal ops overhead.

twitter-bot typescript json-api wikipedia telegrambot wikipedia-scraper cloudflare-workers replicate-api llm-evaluation ai-filter

Updated Sep 7, 2025
TypeScript

LuizHRF / triage_assessment_tool

Application to carry out testings on LLM performing patient triage

python llm-evaluation patient-triage

Updated Oct 10, 2025
Jupyter Notebook

nagababumo / Building-and-Evaluating-Advanced-RAG

python rag llamaindex retrieval-augmented-generation llm-evaluation llm-evaluation-framework

Updated Jun 1, 2024
Jupyter Notebook

nhsengland / evalsense

Tools for systematic large language model evaluations

evaluation-metrics evaluation-framework llm llms llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics llm-benchmarking

Updated Sep 17, 2025
Python

Samya-S / Building-LLMs-from-scratch

A hands-on guide to implementing Large Language Models from scratch

transformer gpt attention-mechanism gpt-2 gpt2 large-language-models llm llms large-language-model llm-training llm-evaluation

Updated Jun 3, 2025
Jupyter Notebook

Node0 / hypercortex

A TUI based LM Swiss army knife and analysis tool

visualization benchmarking quantization hyperparameter-tuning semantic-analysis cli-tools model-diagnostics llm-evaluation

Updated Jun 6, 2025

mescuwa / ASI-Benchmark

An extremely difficult benchmark for LLMs.

benchmark ai puzzle agi asi ai-safety reasoning abstract-reasoning capabilities-evaluation llm llm-evaluation

Updated Jul 31, 2025

KristiArbo / biasbounty1_humaneintelligence

This repo contains my coding notebook for the tutorial series I made for the beginner level bias bounty challenge hosted by Humane Intelligence. I am an AI Ethics Fellow at Humane Intelligence.

python nlp nlp-keywords-extraction bias-detection generative-ai llm-evaluation

Updated Sep 27, 2025
Jupyter Notebook

qubasehq / LLMBuilder

LLMBuilder is a production-ready framework for training and fine-tuning Large Language Models (LLMs) — not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.

machine-learning mlops llm llmops llm-training llm-inference llm-evaluation opensourcellms gguf-models smollm2 llm-from-zero-to-hero gguf-quantization llm-builder

Updated Sep 2, 2025
Python

dosa122 / gta-benchmark

Challenge your AI's algorithmic thinking with GTA Benchmark! Reverse-engineer transformations from input-output pairs. Join now! 🐙🚀

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking llm-agent llm-evaluation algorithmic-reasoning ai-benchmark

Updated Jun 21, 2025
Python

AndreSchuck / EvaluatingLargeLanguageModelsforBrazilianPortugueseSentimentAnalysis

Comparative study of 23 LLMs for Brazilian Portuguese sentiment analysis via in-context learning. Evaluates multilingual vs Portuguese-specialized models across 12 datasets. Code and data included.

machine-learning natural-language-processing sentiment-analysis in-context-learning large-language-models llm-evaluation portuguese-nlp brazilian-portugues

Updated Jun 30, 2025
Jupyter Notebook

vasisthasinghal / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score llm llm-evaluation hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Jul 18, 2025
Python

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."