Stars
Cell2Sentence: Teaching Large Language Models the Language of Biology
A versatile toolkit for applying Logit Lens to modern large language models (LLMs). Currently supports Llama-3.1-8B and Qwen-2.5-7B, enabling layer-wise analysis of hidden states and predictions.
YSDA course in Natural Language Processing
A framework for evaluating Machine Translation models.
The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Toolkit used to collect translations from various online providers and LLMs
🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
A benchmark with locally sourced multilingual questions for 31 languages.
Example competitions for the CodaLab project.
Examples and guides for using the Gemini API
A curated list of research papers and resources on code-switching
Quantifying Language Confusion in LLMs.
[NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Generate text images for training deep learning ocr model
Render documents on a virtual paper with folds and other types of damage using blender geometry nodes.
A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generation
Font rendering, atlas generation and text shaping library written in C++
Cross-platform single header text rendering library for OpenGL
[EMNLP 2025 Demo] PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译,支持 Google/DeepL/Ollama/OpenAI 等服务,提供 CLI/GUI/MCP/Docker/Zotero