Thanks to visit codestin.com
Credit goes to Github.com

Skip to content
View Yu-Shi's full-sized avatar

Organizations

@thunlp

Block or report Yu-Shi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

"Neural Prioritisation for Web Crawling", ICTIR 2025.

Jupyter Notebook 2 Updated Jul 20, 2025

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).

Python 1,972 212 Updated Dec 29, 2025

Create matplotlib plots with broken axes

Python 565 48 Updated Aug 28, 2025

Data Synthesis for Deep Research Based on Semi-Structured Data

Python 198 13 Updated Dec 18, 2025

[ACL 2025 Best Theme Paper] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"

Python 189 14 Updated Aug 29, 2025

gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI

Python 19,765 2,035 Updated Jan 13, 2026

Tongyi Deep Research, the Leading Open-source Deep Research Agent

Python 18,229 1,403 Updated Feb 7, 2026

Code for "Reasoning to Learn from Latent Thoughts"

Python 124 4 Updated Mar 28, 2025

Fully open reproduction of DeepSeek-R1

Python 25,878 2,410 Updated Nov 24, 2025

An open-source AI agent that brings the power of Gemini directly into your terminal.

TypeScript 94,432 11,151 Updated Feb 13, 2026

Revisiting Mid-training in the Era of Reinforcement Learning Scaling

Jupyter Notebook 182 14 Updated Jul 23, 2025

Reproducible, flexible LLM evaluations

Python 337 72 Updated Jan 28, 2026

An autonomous agent that conducts deep research on any data using any LLM providers.

Python 25,301 3,357 Updated Feb 1, 2026

Common tools for data processing

Python 22 3 Updated Dec 8, 2025

General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]

Python 219 12 Updated Nov 27, 2025

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Jupyter Notebook 77 6 Updated May 2, 2025

"Document Quality Scoring for Web Crawling", WOWS 2025.

Python 3 Updated Jul 7, 2025
BibTeX Style 1,457 354 Updated Jan 22, 2026

A feature-rich command-line audio/video downloader

Python 146,978 11,910 Updated Feb 12, 2026

Extract your SlidesLive presentation.

Jupyter Notebook 15 4 Updated Apr 19, 2024

The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Python 42 Updated Dec 29, 2025

An example starter repo for Python projects

Python 311 54 Updated Jun 16, 2025

This repo contains data from datasets used in the VisRAG paper (https://arxiv.org/pdf/2410.10594). Data provided contains images and textual annotations used for testing in the VisRAG paper.

4 Updated Mar 7, 2025

Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]

Python 149 10 Updated Oct 27, 2024

[ICML 2025] Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Python 62 9 Updated Mar 4, 2025

Efficient Triton Kernels for LLM Training

Python 6,141 488 Updated Feb 13, 2026

🌐 Make websites accessible for AI agents. Automate tasks online with ease.

Python 78,290 9,269 Updated Feb 12, 2026

ChatLaw:A Powerful LLM Tailored for Chinese Legal. 中文法律大模型

7,425 601 Updated Jan 4, 2025

LlamaIndex is the leading framework for building LLM-powered agents over your data.

Python 46,978 6,816 Updated Feb 13, 2026
Next