Starred repositories
Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages (ACL 2025)
Scalable data pre processing and curation toolkit for LLMs
BirdNET analyzer for scientific audio data processing.
Identify bird sounds in real time with this Android version of BirdNET. Bird sound recognition for more than 6,000 species worldwide.
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data…
Next-generation Punkt sentence boundary detection with zero dependencies
Visualize Different Text Splitting Methods
Sample code for deep learning & neural networks
Financial data platform for analysts, quants and AI agents.
Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
Convert news articles, blog posts (and more) into audio podcast episodes using natural-sounding AI text-to-speech models
An extremely fast Python linter and code formatter, written in Rust.
A bridge between Lichess bots and chess engines
Curated list of datasets and tools for post-training.
Sunfish: a Python Chess Engine in 111 lines of code
A chess library for Python, with move generation and validation, PGN parsing and writing, Polyglot opening book reading, Gaviota tablebase probing, Syzygy tablebase probing, and UCI/XBoard engine c…
WHATWG-compliant and fast URL parser written in modern C++, part of Internet Archive, Node.js, Clickhouse, Redpanda, Kong, Telegram, Adguard, Datadog and Cloudflare Workers.
List of libraries, tools and APIs for web scraping and data processing.
Build a RAG dataset for your domain in just a few lines of codes, using your XML sitemap
Chatmail Rust Core library, used by Android/iOS/desktop chatmail apps, bindings and bots 📧
🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.