Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Distribution-Aligned Sequence Distillation

Distribution-Aligned Sequence Distillation is a highly efficient and advanced reasoning distillation pipeline. This pipeline leverages innovative techniques like temperature-scheduled learning, divergence-aware sampling, and mixed-policy distillation to achieve state-of-the-art results in mathematical reasoning, code generation, and scientific problem-solving, even with minimal training data. There are different variants:

DASD-4B-Thinking: A lightweight reasoning model excelling in mathematical, scientific, and coding benchmarks, achieving competitive results with multi-stages-training, making it exceptionally data-efficient. It outperforms several larger models (e.g., 32B-scale) on critical benchmarks, including AIME24, AIME25, LiveCodeBench and GPQA-Diamond, despite its smaller size.
DASD-30B-A3B-Thinking-Preview: A variant using a Mixture-of-Experts (MoE) architecture that scales up model capacity while maintaining efficiency through sparse expert routing. Notably, due to time constraints, we trained this model using only the Stage-1 data. Despite this limitation, it outperforms similar models in efficiency-quality trade-offs, showcasing significant improvements across multiple benchmarks like AIME25 and LiveCodeBench.

📣 News

[2026-1-26] 🎉 Our paper, "Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation," which details the motivation and design of Divergence-aware Sampling, has been accepted to ICLR 2026!
[2026-1-20] 🔥 Our SFT Dataset Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b currently ranks #1 on Hugging Face’s datasets trending leaderboard and #5 on the overall trending leaderboard.

[2026-1-16] 🔥 Our Technical Report ranks fifth in the Hugging Face Daily Papers.
[2026-1-15] 🎉 Our Technical Report for Distribution-Aligned Sequence Distillation is now available on Arxiv.
[2026-1-14] 🎉 Our Technical Report for Distribution-Aligned Sequence Distillation is now available on Github.
[2026-1-14] 🔥 SFT Dataset is released! We have released the stage1 and stage2 datasets on Hugging Face and ModelScope.
[2026-1-14] 🔥 DASD-30B-A3B-Thinking-Preview Model is released! We have released the model checkpoint on Hugging Face and ModelScope.
[2026-1-14] 🔥 DASD-4B-Thinking Model is released! We have released the model checkpoint on Hugging Face and ModelScope.

📥 Model Zoo

Model	Hugging Face	ModelScope
DASD-4B-Thinking
DASD-30B-A3B-Thinking-Preview
DASD-30B-A3B-Thinking	To be released	To be released

📈 Performance

4B Dense Model Performance

Model	Data	AIME24	AIME25	LCB v5	LCB v6	GPQA-D
Qwen3-4B-Thinking-2507	❌	-	81.3	-	55.2	65.8
Qwen3-14B	❌	79.3	70.4	63.5	-	64.0
Qwen3-32B	❌	81.4	72.9	65.7	-	68.4
DeepSeek-R1-0528-Qwen3-8B	❌	86.0	76.3	60.5	-	61.1
GLM-Z1-32B-0414	❌	80.8	63.6	59.1	-	66.1
GLM-Z1-9B-0414	❌	76.4	56.6	51.8	-	58.5
Mistral3-3B	❌	-	72.1	54.8	-	53.4
Mistral3-8B	❌	-	78.7	61.6	-	66.8
AM-thinking-v1	✅	85.3	74.4	70.3	-	-
POLARIS-4B-Preview	✅	81.2	79.4	-	-	-
OpenThoughts3-7B	✅	69.0	53.3	51.7	-	53.7
Pai-DistillQwen-ThoughtY-4B	✅	76.7	-	-	-	56.1
Pai-DistillQwen-ThoughtY-8B	✅	76.7	-	-	-	62.1
NVIDIA-OpenReasoning-Nemotron-7B	✅	84.7	78.2	63.9	-	61.4
NVIDIA-Nemotron-Ultra-253B	✅	80.8	72.5	68.1	-	76.0
DASD-4B-Thinking (Ours)	✅	88.5	83.3	69.3	67.5	68.4

30B MoE Model Performance

DASD-30B-A3B-Thinking-Preview (trained on Stage 1 data only) demonstrates incredible data efficiency.

Model	AIME25	LCB v6	GPQA-D	Average
gpt-oss-20b	91.7	61.0	71.5	74.7
Qwen3-30B-A3B-Thinking-2507	85.0	66.0	73.4	74.8
NVIDIA-Nemotron-3-Nano-30B-A3B	89.1	68.3	73.0	76.8
DASD-30B-A3B-Thinking-Preview (Ours)	86.7	72.8	72.3	77.3

🧩 Why DASD Matters

DASD presents a systematic reexamining of sequence-level distillation and introduces a novel distribution-aligned sequence distillation paradigm. Using only 448K training samples, this approach substantially enhances the performance of a 4B-parameter dense model on mathematical, code, and scientific reasoning tasks, achieving state-of-the-art results among models of comparable size and even surpassing some larger models. Moreover, it provides an open and data-efficient resource for transferring long-form reasoning capabilities into student model while maintaining robust performance across multiple reasoning benchmarks.

1. Open and Data-Efficient Resource

Open-Source Model: We release DASD-4B-Thinking, a compact yet capable 4B dense language model specialized in long chain-of-thought (Long-CoT) reasoning across mathematics, code generation, and scientific reasoning.
Open-Source Data: We release the Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b, allowing the community to reproduce our off-policy temperature-scheduled pipeline:
- 105K Low-Temperature responses for stability (Stage 1).
- 330K High-Temperature responses for diversity (Stage 2).
Proven Scalability: The exact same data recipe generalizes effectively to larger architectures, as demonstrated by our DASD-30B-A3B-Thinking-Preview (MoE), which achieves competitive performance without extra RL.

2. Distribution-Aligned Sequence Distillation

Many distillation pipelines optimize only for answer correctness or rely on heuristic filtering, which can collapse reasoning diversity and amplify sampling bias. Apsara-Reason instead treats reasoning distillation as a sequence-level distributional object, and optimizes the student to better match the teacher’s behavior across the full trajectory.

The pipeline integrates three complementary mechanisms:

Temperature-scheduled Learning: combines low-temperature trajectories (stability) with higher-temperature trajectories (diversity and exploration), improving both learning stability and coverage of teacher's modes.
Divergence-aware Sampling: selects sequences based on teacher–student distributional mismatch at the sentence level and find target sequence-level distribution better supports effective learning.
Mixed-policy Distillation: introduces a lightweight on-policy revision protocol after off-policy SFT, using student-generated prefixes with teacher rewrites to improve robustness and mitigating exposure bias of distilled model

Together, these components form a coherent, distribution-aware sequence distillation pipeline that improves the fidelity of reasoning transfer while remaining data-efficient, making strong long-reasoning behavior feasible in a 4B dense model.

⚡ Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Alibaba-Apsara/DASD-4B-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=81920,
)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)

Note: We include the system prompt, as it was used during all training stages. To ensure consistent output quality, we recommend including the same system prompt during actual usage; otherwise, the model's responses may be significantly affected.

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 or to create an OpenAI-compatible API endpoint:

SGLang:

python -m sglang.launch_server --model-path Alibaba-Apsara/DASD-4B-Thinking --context-length 262144

vLLM:

vllm serve Alibaba-Apsara/DASD-4B-Thinking --max-model-len 262144

Training

We utilize LLaMA-Factory framework for training, which provides a convenient and efficient training pipeline.

Training Setup

Environment Setup
- Set up LLaMA-Factory following their official documentation.
- Set up Liger-Kernel following their official documentation.
- Ensure all dependencies are properly installed and configured.
Data Preparation
- Obtain the Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b dataset from 🤗 Hugging Face or 🤖 ModelScope.
- Format the dataset according to LLaMA-Factory's data preparation guidelines.
Configuration
- Use our provided configuration file.
- The config file contains all necessary hyperparameters and training settings.

📜 Citation

If you find our work useful in your research, please consider citing:

@article{yan2026dasd,
  title={Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning},
  author={Yan, Shaotian and Liu, Kaiyuan and Shen, Chen and Wang, Bing and Fan, Sinan and Zhang, Jun and Wu, Yue and Wang, Zheng and Ye, Jieping},
  year={2026},
  journal={arXiv preprint arXiv:2601.09088},
  url={https://arxiv.org/abs/2601.09088}
}

@article{liu2025where,
  title={Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation},
  author={Liu, Kaiyuan and Yan, Shaotian and Miao, Rui and Wang, Bing and Shen, Chen and Zhang, Jun and Ye, Jieping},
  journal={arXiv preprint arXiv:2512.20908},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
train		train
README.md		README.md
dasd_technical_report.pdf		dasd_technical_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Distribution-Aligned Sequence Distillation

📣 News

📥 Model Zoo

📈 Performance

4B Dense Model Performance

30B MoE Model Performance

🧩 Why DASD Matters

1. Open and Data-Efficient Resource

2. Distribution-Aligned Sequence Distillation

⚡ Quick Start

Training

Training Setup

📜 Citation

About

Uh oh!

Releases

Packages

D2I-ai/dasd-thinking

Folders and files

Latest commit

History

Repository files navigation

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Distribution-Aligned Sequence Distillation

📣 News

📥 Model Zoo

📈 Performance

4B Dense Model Performance

30B MoE Model Performance

🧩 Why DASD Matters

1. Open and Data-Efficient Resource

2. Distribution-Aligned Sequence Distillation

⚡ Quick Start

Training

Training Setup

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages