Thanks to visit codestin.com
Credit goes to github.com

Skip to content

l1351868270/ld_rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 

Repository files navigation

PARL(Parallel-Agent Reinforcement Learning)

Kimi K2.5: Visual Agentic Intelligence

Scaling LLM Training: How Parallel-Agent Reinforcement Learning Changes the Game

梯度爆炸

Logit Dynamics in Softmax Policy Gradient Methods

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

GradLoc:将梯度突刺定位到单个异常词元

熵塌缩

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

训推不一致

Your Efficient RL Framework Secretly Brings You Off-Policy RL Training

Small Leak Can Sink a Great Ship—Boost RL Training on MoE with 𝑰𝒄𝒆𝑷𝒐𝒑!

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

Defeating the Training-Inference Mismatch via FP16

Defeating Nondeterminism in LLM Inference

Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

The Optimal Token Baseline

KAT-Coder-V1 Pro 重磅升级,揭秘强化学习训练稳定性关键因素

从 tokenizer 视角来分析 Agentic 多轮训练的复杂性

Training-Inference Parity in MoE Models: Where Numerics Drift

Stochastic CHAOS: Why Deterministic Inference Kills, and Distributional Variability Is the Heartbeat of Artifical Cognition

KL梯度无偏估计

Approximating KL Divergence

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

On a few pitfalls in KL divergence gradient estimation for RL

The Policy Gradient, Bias, and Variance of OPD

重探 On-Policy Distillation(OPD):三类典型失败以及修复路径

优化

Modular Manifolds

Last Iterate of SGD Converges (Even in Unbounded Domains)

FP8 训练

DeepSeek-V3 Technical Report

Optimizing Large Language Model Training Using FP4 Quantization

NVFP4 Pretraining: From Theory to Implementation (Part 1)

NVFP4 Pretraining: Systems Optimizations (Part 2)

SGLang RL x slime: QAT INT4 全流程实现

GPUs Go Brrr

Getting Memory-bound Kernels to Speed-of-Light

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

I spent 31 hours on the math behind TurboQuant so you don't have to

流形

SpinQuant: LLM quantization with learned rotations

mHC: Manifold-Constrained Hyper-Connections

Back to Basics: Let Denoising Generative Models Denoise

顿悟(grokking)

Do Machine Learning Models Memorize or Generalize?

策略梯度

Group Sequence Policy Optimization

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

The 37 Implementation Details of Proximal Policy Optimization

LoRA RL

LoRA Without Regret

How We Build Trillion Parameter Reasoning RL with 10% GPUs

Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA

算子

Learn CUTLASS the hard way!

Learn CUTLASS the hard way - part 2!

CUDA Coalesced Memory Access

Getting Memory-bound Kernels to Speed-of-Light

Local Memory and Register Spilling

CuTeDSL at Perplexity

misc

Heaps do lie: debugging a memory leak in vLLM

分布式

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyond

cudagraph

Accelerating PyTorch with CUDA Graphs

Getting Started with CUDA Graphs

Effortless CUDA Graphs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors