NPU Training Project

LLM training on Huawei Ascend 910A NPU with support for Data Parallel (DDP), Pipeline Parallel (PP), and custom Tensor Parallelism (TP).

Environment

Python 3.11
PyTorch 2.5.1 + torch-npu 2.5.1
transformers 4.57.x
CANN 8.1.RC1

Quick Start

# Single NPU training (~30 minutes)
chmod +x run.sh && ./run.sh

Training Modes

Single NPU

./run.sh --model_name distilgpt2 --batch_size 16 --epochs 5

Data Parallel (DDP) - 8 NPUs

Replicates model across all NPUs, synchronizes gradients.

chmod +x run_ddp.sh && ./run_ddp.sh

Pipeline Parallel (PP) - 8 NPUs

Splits model layers across NPUs (e.g., 4-stage pipeline).

chmod +x run_pp.sh && ./run_pp.sh --pp_size 4

Tensor Parallel (TP) - Custom Implementation

Splits weights across NPUs using custom npu_parallel module (Megatron-LM style).

python examples/train_tp_custom.py --model_path "/path/to/model" --tp_size 4

npu_parallel Module

Custom Tensor Parallelism implementation for Ascend NPU/HCCL:

tp_layers.py: ColumnParallelLinear, RowParallelLinear
tp_attention.py: TPAttention, TPMLP with SwiGLU
convert_model.py: Auto-convert HuggingFace models to TP

Supported models: Qwen/Qwen2/Qwen2.5, Llama, Mistral, Gemma, Phi, Yi, DeepSeek, Baichuan, GPT-2.

Model Management

# Download models via ModelScope
python3 download_models.py

# Download with aria2c (parallel, resume-capable)
python3 download_with_aria2.py

# Fix corrupted downloads (safetensors integrity)
python3 fix_downloads.py

# Generate model manifest with PP/TP suggestions
python3 model_manifest.py

Verification

# Check NPU availability
python3 -c "import torch_npu; print(torch.npu.is_available())"

# Run tests
python3 tests/test_tp_conversion.py
python3 tests/test_tp_mlp_only.py
python3 tests/test_hccl_ops.py

Configuration

Setting	Value
Model storage	`/home/sd/npu_train/models`
HTTP proxy	`http://127.0.0.1:7890`
HF mirror	`https://hf-mirror.com`
HCCL timeout	`1200s`

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
examples		examples
npu_parallel		npu_parallel
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
MOE_UPDATE_SUMMARY.md		MOE_UPDATE_SUMMARY.md
README.md		README.md
TEST_GUIDE.md		TEST_GUIDE.md
check_download.sh		check_download.sh
download_deepseek_moe.py		download_deepseek_moe.py
download_models.py		download_models.py
download_with_aria2.py		download_with_aria2.py
download_with_wget.py		download_with_wget.py
fix_downloads.py		fix_downloads.py
load_example.py		load_example.py
model_manifest.py		model_manifest.py
robust_download.py		robust_download.py
run.sh		run.sh
run_ddp.sh		run_ddp.sh
run_pp.sh		run_pp.sh
run_real_training.sh		run_real_training.sh
simple_resume_download.py		simple_resume_download.py
start_download.sh		start_download.sh
sync_to_remote.sh		sync_to_remote.sh
test_moe_on_remote.sh		test_moe_on_remote.sh
train.py		train.py
train_ddp.py		train_ddp.py
train_pp.py		train_pp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPU Training Project

Environment

Quick Start

Training Modes

Single NPU

Data Parallel (DDP) - 8 NPUs

Pipeline Parallel (PP) - 8 NPUs

Tensor Parallel (TP) - Custom Implementation

npu_parallel Module

Model Management

Verification

Configuration

About

Uh oh!

Releases

Packages

Languages

stark-sim/npu_train

Folders and files

Latest commit

History

Repository files navigation

NPU Training Project

Environment

Quick Start

Training Modes

Single NPU

Data Parallel (DDP) - 8 NPUs

Pipeline Parallel (PP) - 8 NPUs

Tensor Parallel (TP) - Custom Implementation

npu_parallel Module

Model Management

Verification

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages