[ACL 2025 Best Theme Paper] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"

Python 189 14 Updated Aug 29, 2025

openai / gpt-oss

gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI

Python 19,765 2,035 Updated Jan 13, 2026

Alibaba-NLP / DeepResearch

Tongyi Deep Research, the Leading Open-source Deep Research Agent

Python 18,229 1,403 Updated Feb 7, 2026

ryoungj / BoLT

Code for "Reasoning to Learn from Latent Thoughts"

Python 124 4 Updated Mar 28, 2025

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 25,878 2,410 Updated Nov 24, 2025

google-gemini / gemini-cli

An open-source AI agent that brings the power of Gemini directly into your terminal.

TypeScript 94,432 11,151 Updated Feb 13, 2026

GAIR-NLP / OctoThinker

Revisiting Mid-training in the Era of Reinforcement Learning Scaling

Jupyter Notebook 182 14 Updated Jul 23, 2025

allenai / olmes

Reproducible, flexible LLM evaluations

Python 337 72 Updated Jan 28, 2026

assafelovic / gpt-researcher

An autonomous agent that conducts deep research on any data using any LLM providers.

Python 25,301 3,357 Updated Feb 1, 2026

CodeCreator / datatools

Common tools for data processing

Python 22 3 Updated Dec 8, 2025

TIGER-AI-Lab / General-Reasoner

General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]

Python 219 12 Updated Nov 27, 2025

CodeCreator / WebOrganizer

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Jupyter Notebook 77 6 Updated May 2, 2025

fpezzuti / quality_crawling

"Document Quality Scoring for Web Crawling", WOWS 2025.

Python 3 Updated Jul 7, 2025

MCG-NKU / NSFC-LaTex

BibTeX Style 1,457 354 Updated Jan 22, 2026

yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader

Python 146,978 11,910 Updated Feb 12, 2026

So-Cool / myslideslive

Extract your SlidesLive presentation.

Jupyter Notebook 15 4 Updated Apr 19, 2024

sail-sg / SkyLadder

Forked from jzhang38/TinyLlama

The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Python 42 Updated Dec 29, 2025

neubig / starter-repo

An example starter repo for Python projects

Python 311 54 Updated Jun 16, 2025

cagey-squirrel / VisRAG-datasets

This repo contains data from datasets used in the VisRAG paper (https://arxiv.org/pdf/2410.10594). Data provided contains images and textual annotations used for testing in the VisRAG paper.

4 Updated Mar 7, 2025