You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Notebooks for evaluating LLM outputs using various metrics, covering scenarios with and without known ground truth. Includes criteria such as correctness, coherence, relevance, and more, providing a comprehensive approach to assess LLM performance accurately and efficiently.
Agentic Workflow Evaluation: Text Summarization Agent. This project includes an AI agent evaluation workflow using a text summarization model with OpenAI API and Transformers library. It follows an iterative approach: generate summaries, analyze metrics, adjust parameters, and retest to refine AI agents for accuracy, readability, and performance.
LLM evaluation framework for application modernization. Assesses AI-generated code fixes for Konveyor rule violations across functional correctness, quality, security, and explainability. Features Grafana-style reporting for cost estimation and model selection for migrations.
Cloudflare Workers app that watches Wikipedia for newly reported notable deaths, LLMβfilters and deβduplicates them, then publishes concise memorial posts (Telegram + X) via a lightweight public JSON API. Automates detection, verification, and multiβplatform distribution with low latency and minimal ops overhead.
This repo contains my coding notebook for the tutorial series I made for the beginner level bias bounty challenge hosted by Humane Intelligence. I am an AI Ethics Fellow at Humane Intelligence.
LLMBuilder is a production-ready framework for training and fine-tuning Large Language Models (LLMs) β not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.
Comparative study of 23 LLMs for Brazilian Portuguese sentiment analysis via in-context learning. Evaluates multilingual vs Portuguese-specialized models across 12 datasets. Code and data included.