DCLM-German is a curated high-quality German pretraining dataset constructed by applying a modified version of the DCLM-Baseline data curation pipeline from DataComp-LM to filter German data from selected Common Crawl snapshots (DCLM-Pool). Through model-based quality filtering, we derive a corpus of approximately 150 billion tokens.
We train language models with up to 1 billion parameters on DCLM-German and evaluate them on popular language model performance benchmarks. Our results demonstrate the effectiveness of the DCLM pipeline beyond English and provide insights into the relationship between task-specific data in pretraining corpora and benchmark performance.
DCLM-German processes 279.6 TiB of Common Crawl data into a curated 150B token German corpus using advanced deduplication and quality filtering. We train language models up to 1B parameters and demonstrate competitive performance on German benchmarks, particularly excelling in commonsense reasoning (HellaSwag) and science reasoning (ARC) tasks compared to existing German datasets.
Key Contributions: Successful adaptation of the DCLM pipeline for German, comprehensive comparison of ensemble vs. standard filtering approaches, and analysis revealing how synthetic data generation with QA-specific prompts may inflate MMLU performance.
Resources: Models • ARC-Easy-DE Dataset • Paper
Performance comparison across model scales (80M, 160M, 400M, 1B parameters) on German benchmarks. DCLM-German shows consistent improvements with scale on HellaSwag and ARC benchmarks, while demonstrating competitive performance compared to Aleph-Alpha-GermanWeb synthetic dataset. The chart illustrates how different datasets perform across MMLU, HellaSwag, ARC Challenge, and ARC Easy benchmarks.
Our pipeline adapts the DCLM-Baseline approach for German language data curation:
- Starting Point: DCLM-Pool (279.6 TiB compressed)
- Coverage: 89 Common Crawl snapshots (2013-2023)
- Text Extraction: Using resiliparse
We adapted RefinedWeb's heuristic filters for German by translating banned words and stop words using DeepL, while applying the same fastText language classifier and standard filters (URL, page length, repetition, word removal ratio). This yields 3.36B German documents totaling 3.9T tokens.
Unlike DCLM-Baseline's local approach, we perform global fuzzy deduplication using MinHash (ngram size: 5, bands: 14, band size: 9) with a modified Rust implementation. This removes 62% of documents, resulting in 1.28B documents (1.5T tokens).
We implemented two approaches:
We train a fastText classifier on 100k translated OpenHermes-2.5 sequences as positive examples and 100k deduplicated corpus samples as negatives, using GPT-4o-mini for translation. Selecting the top 10% documents yields DCLM-German (150B tokens).
This approach globally ranks documents using the worst-rank between fastText quality scores and duplicate counts. Selecting the top 10% produces DCLM-German-FTR (148B tokens).
We train models at four scales (80M, 160M, 400M, 1B parameters) using the GPT-4 tokenizer optimized for German, and evaluate on German versions of MMLU, HellaSwag, ARC Challenge/Easy, and TruthfulQA.
# Clone the repository
git clone https://github.com/faidrapts/dclm-german.git
cd dclm-german
# Install dependencies
pip install -r requirements.txtTo run the pipeline, the scripts in this repository should be used in combination with the pipeline components from the original DataComp-LM repository.
import open_lm
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "faidrap/dclm-german-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text
text = "Die Hauptstadt von Deutschland ist"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))| Benchmark | Description | German Version |
|---|---|---|
| MMLU | Massive Multitask Language Understanding across 57 subjects | MMMLU German subset |
| HellaSwag | Commonsense reasoning with adversarial multiple choice | German translation |
| ARC Challenge/Easy | Grade-school science reasoning | ARC-Easy-DE (our translation) |
| TruthfulQA | Avoiding common misconceptions | German translation |
DCLM-German consistently outperforms Aleph-Alpha-GermanWeb on HellaSwag across all model scales and shows strong performance on ARC benchmarks with clear scaling benefits. While initially underperforming on MMLU, DCLM-German improves by +9.4% after task-specific finetuning compared to only +2.1% for Aleph-Alpha Synthetic, indicating that synthetic data generation with QA-specific prompts may inflate MMLU scores while DCLM-German demonstrates better generalization capacity.
dclm-german/
├── baselines/ # Filtering and quality assessment tools
├── dedup/ # Deduplication scripts and tools
├── download/ # Data download utilities
├── eval/ # Evaluation benchmarks and scripts
├── filter_refinedweb-german/ # German-specific filtering
├── training/ # Model training configurations
├── utils/ # Utility scripts
├── paper/ # Research paper and documentation
└── assets/ # Visualization and figures
If you use DCLM-German in your research, please cite our paper:
@article{patsatzi2025dclm,
title={DCLM-German: Curating a German Pretraining Dataset for Language Models},
author={Patsatzi, Faidra Anastasia and Heckel, Reinhard},
journal={Technical University of Munich},
year={2025},
url={https://github.com/faidrapts/dclm-german}
}DataComp-LM (Li et al., 2024) provides the original DCLM framework. FineWeb (Penedo et al., 2024) demonstrates large-scale web data curation. Aleph-Alpha-GermanWeb ([Burchardt et al., 2025]) offers comparison German pretraining datasets. OpenHermes-2.5 (Teknium, 2023) and the MMLU auxiliary train (CAIS) set serve as our instruction tuning dataset source.
This project is licensed under the MIT License. We welcome contributions including bug reports, improvements, new benchmarks, and extensions to other languages.
Acknowledgments: TU Dresden ZIH HPC, Reinhard Heckel, DataComp-LM team, Technical University of Munich, OpenAI, and HuggingFace, for computational resources and platform support.
Contact: [email protected]
For questions, issues, or collaboration opportunities, please open an issue or contact the authors directly.