DCLM-German: Curating a German Pretraining Dataset for Language Models

Abstract

DCLM-German is a curated high-quality German pretraining dataset constructed by applying a modified version of the DCLM-Baseline data curation pipeline from DataComp-LM to filter German data from selected Common Crawl snapshots (DCLM-Pool). Through model-based quality filtering, we derive a corpus of approximately 150 billion tokens.

We train language models with up to 1 billion parameters on DCLM-German and evaluate them on popular language model performance benchmarks. Our results demonstrate the effectiveness of the DCLM pipeline beyond English and provide insights into the relationship between task-specific data in pretraining corpora and benchmark performance.

Overview

DCLM-German processes 279.6 TiB of Common Crawl data into a curated 150B token German corpus using advanced deduplication and quality filtering. We train language models up to 1B parameters and demonstrate competitive performance on German benchmarks, particularly excelling in commonsense reasoning (HellaSwag) and science reasoning (ARC) tasks compared to existing German datasets.

Key Contributions: Successful adaptation of the DCLM pipeline for German, comprehensive comparison of ensemble vs. standard filtering approaches, and analysis revealing how synthetic data generation with QA-specific prompts may inflate MMLU performance.

Resources: Models • ARC-Easy-DE Dataset • Paper

Performance Results

Performance comparison across model scales (80M, 160M, 400M, 1B parameters) on German benchmarks. DCLM-German shows consistent improvements with scale on HellaSwag and ARC benchmarks, while demonstrating competitive performance compared to Aleph-Alpha-GermanWeb synthetic dataset. The chart illustrates how different datasets perform across MMLU, HellaSwag, ARC Challenge, and ARC Easy benchmarks.

Methodology

Our pipeline adapts the DCLM-Baseline approach for German language data curation:

1. Data Source

Starting Point: DCLM-Pool (279.6 TiB compressed)
Coverage: 89 Common Crawl snapshots (2013-2023)
Text Extraction: Using resiliparse

2. Heuristic Filtering

We adapted RefinedWeb's heuristic filters for German by translating banned words and stop words using DeepL, while applying the same fastText language classifier and standard filters (URL, page length, repetition, word removal ratio). This yields 3.36B German documents totaling 3.9T tokens.

3. Global Deduplication

Unlike DCLM-Baseline's local approach, we perform global fuzzy deduplication using MinHash (ngram size: 5, bands: 14, band size: 9) with a modified Rust implementation. This removes 62% of documents, resulting in 1.28B documents (1.5T tokens).

4. Quality Filtering

We implemented two approaches:

Standard Model-Based Filtering

We train a fastText classifier on 100k translated OpenHermes-2.5 sequences as positive examples and 100k deduplicated corpus samples as negatives, using GPT-4o-mini for translation. Selecting the top 10% documents yields DCLM-German (150B tokens).

Ensemble Filtering

This approach globally ranks documents using the worst-rank between fastText quality scores and duplicate counts. Selecting the top 10% produces DCLM-German-FTR (148B tokens).

5. Model Training & Evaluation

We train models at four scales (80M, 160M, 400M, 1B parameters) using the GPT-4 tokenizer optimized for German, and evaluate on German versions of MMLU, HellaSwag, ARC Challenge/Easy, and TruthfulQA.

Installation & Usage

Prerequisites

# Clone the repository
git clone https://github.com/faidrapts/dclm-german.git
cd dclm-german

# Install dependencies
pip install -r requirements.txt

Running the pipeline

To run the pipeline, the scripts in this repository should be used in combination with the pipeline components from the original DataComp-LM repository.

Using Pre-trained Models

import open_lm
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "faidrap/dclm-german-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
text = "Die Hauptstadt von Deutschland ist"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation

Benchmarks Used

Benchmark	Description	German Version
MMLU	Massive Multitask Language Understanding across 57 subjects	MMMLU German subset
HellaSwag	Commonsense reasoning with adversarial multiple choice	German translation
ARC Challenge/Easy	Grade-school science reasoning	ARC-Easy-DE (our translation)
TruthfulQA	Avoiding common misconceptions	German translation

Key Findings

DCLM-German consistently outperforms Aleph-Alpha-GermanWeb on HellaSwag across all model scales and shows strong performance on ARC benchmarks with clear scaling benefits. While initially underperforming on MMLU, DCLM-German improves by +9.4% after task-specific finetuning compared to only +2.1% for Aleph-Alpha Synthetic, indicating that synthetic data generation with QA-specific prompts may inflate MMLU scores while DCLM-German demonstrates better generalization capacity.

Repository Structure

dclm-german/
├── baselines/          # Filtering and quality assessment tools
├── dedup/              # Deduplication scripts and tools
├── download/           # Data download utilities
├── eval/               # Evaluation benchmarks and scripts
├── filter_refinedweb-german/  # German-specific filtering
├── training/           # Model training configurations
├── utils/              # Utility scripts
├── paper/              # Research paper and documentation
└── assets/             # Visualization and figures

Citation

If you use DCLM-German in your research, please cite our paper:

@article{patsatzi2025dclm,
  title={DCLM-German: Curating a German Pretraining Dataset for Language Models},
  author={Patsatzi, Faidra Anastasia and Heckel, Reinhard},
  journal={Technical University of Munich},
  year={2025},
  url={https://github.com/faidrapts/dclm-german}
}

References & Related Work

DataComp-LM (Li et al., 2024) provides the original DCLM framework. FineWeb (Penedo et al., 2024) demonstrates large-scale web data curation. Aleph-Alpha-GermanWeb ([Burchardt et al., 2025]) offers comparison German pretraining datasets. OpenHermes-2.5 (Teknium, 2023) and the MMLU auxiliary train (CAIS) set serve as our instruction tuning dataset source.

Contributing & License

This project is licensed under the MIT License. We welcome contributions including bug reports, improvements, new benchmarks, and extensions to other languages.

Acknowledgments: TU Dresden ZIH HPC, Reinhard Heckel, DataComp-LM team, Technical University of Munich, OpenAI, and HuggingFace, for computational resources and platform support.

Contact: [email protected]

For questions, issues, or collaboration opportunities, please open an issue or contact the authors directly.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
ablations		ablations
assets		assets
baselines		baselines
dedup		dedup
download		download
eval		eval
filter_refinedweb-german		filter_refinedweb-german
paper		paper
rust_processing/tokshuf-rs		rust_processing/tokshuf-rs
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md
check_processed_folder_sizes.sh		check_processed_folder_sizes.sh
corpus_stats.json		corpus_stats.json
corpus_stats_fasttext.json		corpus_stats_fasttext.json
dedup-test.sh		dedup-test.sh
dedup_batch.sh		dedup_batch.sh
process-sizes-all.txt		process-sizes-all.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DCLM-German: Curating a German Pretraining Dataset for Language Models

Abstract

Overview

Performance Results

Methodology

1. Data Source

2. Heuristic Filtering

3. Global Deduplication

4. Quality Filtering

Standard Model-Based Filtering

Ensemble Filtering

5. Model Training & Evaluation

Installation & Usage

Prerequisites

Running the pipeline

Using Pre-trained Models

Evaluation

Benchmarks Used

Key Findings

Repository Structure

Citation

References & Related Work

Contributing & License

About

Uh oh!

Releases

Packages

Languages

faidrapts/dclm-german

Folders and files

Latest commit

History

Repository files navigation

DCLM-German: Curating a German Pretraining Dataset for Language Models

Abstract

Overview

Performance Results

Methodology

1. Data Source

2. Heuristic Filtering

3. Global Deduplication

4. Quality Filtering

Standard Model-Based Filtering

Ensemble Filtering

5. Model Training & Evaluation

Installation & Usage

Prerequisites

Running the pipeline

Using Pre-trained Models

Evaluation

Benchmarks Used

Key Findings

Repository Structure

Citation

References & Related Work

Contributing & License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages