- Pittsburgh, PA
-
16:50
(UTC -05:00) - https://yu-shi.github.io/
- https://orcid.org/0000-0001-6335-1076
Lists (5)
Sort Name ascending (A-Z)
Starred repositories
"Neural Prioritisation for Web Crawling", ICTIR 2025.
Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
Data Synthesis for Deep Research Based on Semi-Structured Data
[ACL 2025 Best Theme Paper] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Fully open reproduction of DeepSeek-R1
An open-source AI agent that brings the power of Gemini directly into your terminal.
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
An autonomous agent that conducts deep research on any data using any LLM providers.
General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
"Document Quality Scoring for Web Crawling", WOWS 2025.
A feature-rich command-line audio/video downloader
Extract your SlidesLive presentation.
sail-sg / SkyLadder
Forked from jzhang38/TinyLlamaThe official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
An example starter repo for Python projects
This repo contains data from datasets used in the VisRAG paper (https://arxiv.org/pdf/2410.10594). Data provided contains images and textual annotations used for testing in the VisRAG paper.
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
[ICML 2025] Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Efficient Triton Kernels for LLM Training
🌐 Make websites accessible for AI agents. Automate tasks online with ease.
ChatLaw:A Powerful LLM Tailored for Chinese Legal. 中文法律大模型
LlamaIndex is the leading framework for building LLM-powered agents over your data.