Thanks to visit codestin.com
Credit goes to github.com

Skip to content

faidrapts/dclm-german

Repository files navigation

DCLM-German: Curating a German Pretraining Dataset for Language Models

Paper Models Dataset

Abstract

DCLM-German is a curated high-quality German pretraining dataset constructed by applying a modified version of the DCLM-Baseline data curation pipeline from DataComp-LM to filter German data from selected Common Crawl snapshots (DCLM-Pool). Through model-based quality filtering, we derive a corpus of approximately 150 billion tokens.

We train language models with up to 1 billion parameters on DCLM-German and evaluate them on popular language model performance benchmarks. Our results demonstrate the effectiveness of the DCLM pipeline beyond English and provide insights into the relationship between task-specific data in pretraining corpora and benchmark performance.

Overview

DCLM-German processes 279.6 TiB of Common Crawl data into a curated 150B token German corpus using advanced deduplication and quality filtering. We train language models up to 1B parameters and demonstrate competitive performance on German benchmarks, particularly excelling in commonsense reasoning (HellaSwag) and science reasoning (ARC) tasks compared to existing German datasets.

Key Contributions: Successful adaptation of the DCLM pipeline for German, comprehensive comparison of ensemble vs. standard filtering approaches, and analysis revealing how synthetic data generation with QA-specific prompts may inflate MMLU performance.

Resources: ModelsARC-Easy-DE DatasetPaper

Performance Results

Model Performance Comparison

Performance comparison across model scales (80M, 160M, 400M, 1B parameters) on German benchmarks. DCLM-German shows consistent improvements with scale on HellaSwag and ARC benchmarks, while demonstrating competitive performance compared to Aleph-Alpha-GermanWeb synthetic dataset. The chart illustrates how different datasets perform across MMLU, HellaSwag, ARC Challenge, and ARC Easy benchmarks.

Methodology

Our pipeline adapts the DCLM-Baseline approach for German language data curation:

1. Data Source

  • Starting Point: DCLM-Pool (279.6 TiB compressed)
  • Coverage: 89 Common Crawl snapshots (2013-2023)
  • Text Extraction: Using resiliparse

2. Heuristic Filtering

We adapted RefinedWeb's heuristic filters for German by translating banned words and stop words using DeepL, while applying the same fastText language classifier and standard filters (URL, page length, repetition, word removal ratio). This yields 3.36B German documents totaling 3.9T tokens.

3. Global Deduplication

Unlike DCLM-Baseline's local approach, we perform global fuzzy deduplication using MinHash (ngram size: 5, bands: 14, band size: 9) with a modified Rust implementation. This removes 62% of documents, resulting in 1.28B documents (1.5T tokens).

4. Quality Filtering

We implemented two approaches:

Standard Model-Based Filtering

We train a fastText classifier on 100k translated OpenHermes-2.5 sequences as positive examples and 100k deduplicated corpus samples as negatives, using GPT-4o-mini for translation. Selecting the top 10% documents yields DCLM-German (150B tokens).

Ensemble Filtering

This approach globally ranks documents using the worst-rank between fastText quality scores and duplicate counts. Selecting the top 10% produces DCLM-German-FTR (148B tokens).

5. Model Training & Evaluation

We train models at four scales (80M, 160M, 400M, 1B parameters) using the GPT-4 tokenizer optimized for German, and evaluate on German versions of MMLU, HellaSwag, ARC Challenge/Easy, and TruthfulQA.

Installation & Usage

Prerequisites

# Clone the repository
git clone https://github.com/faidrapts/dclm-german.git
cd dclm-german

# Install dependencies
pip install -r requirements.txt

Running the pipeline

To run the pipeline, the scripts in this repository should be used in combination with the pipeline components from the original DataComp-LM repository.

Using Pre-trained Models

import open_lm
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "faidrap/dclm-german-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
text = "Die Hauptstadt von Deutschland ist"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation

Benchmarks Used

Benchmark Description German Version
MMLU Massive Multitask Language Understanding across 57 subjects MMMLU German subset
HellaSwag Commonsense reasoning with adversarial multiple choice German translation
ARC Challenge/Easy Grade-school science reasoning ARC-Easy-DE (our translation)
TruthfulQA Avoiding common misconceptions German translation

Key Findings

DCLM-German consistently outperforms Aleph-Alpha-GermanWeb on HellaSwag across all model scales and shows strong performance on ARC benchmarks with clear scaling benefits. While initially underperforming on MMLU, DCLM-German improves by +9.4% after task-specific finetuning compared to only +2.1% for Aleph-Alpha Synthetic, indicating that synthetic data generation with QA-specific prompts may inflate MMLU scores while DCLM-German demonstrates better generalization capacity.

Repository Structure

dclm-german/
├── baselines/          # Filtering and quality assessment tools
├── dedup/              # Deduplication scripts and tools
├── download/           # Data download utilities
├── eval/               # Evaluation benchmarks and scripts
├── filter_refinedweb-german/  # German-specific filtering
├── training/           # Model training configurations
├── utils/              # Utility scripts
├── paper/              # Research paper and documentation
└── assets/             # Visualization and figures

Citation

If you use DCLM-German in your research, please cite our paper:

@article{patsatzi2025dclm,
  title={DCLM-German: Curating a German Pretraining Dataset for Language Models},
  author={Patsatzi, Faidra Anastasia and Heckel, Reinhard},
  journal={Technical University of Munich},
  year={2025},
  url={https://github.com/faidrapts/dclm-german}
}

References & Related Work

DataComp-LM (Li et al., 2024) provides the original DCLM framework. FineWeb (Penedo et al., 2024) demonstrates large-scale web data curation. Aleph-Alpha-GermanWeb ([Burchardt et al., 2025]) offers comparison German pretraining datasets. OpenHermes-2.5 (Teknium, 2023) and the MMLU auxiliary train (CAIS) set serve as our instruction tuning dataset source.

Contributing & License

This project is licensed under the MIT License. We welcome contributions including bug reports, improvements, new benchmarks, and extensions to other languages.

Acknowledgments: TU Dresden ZIH HPC, Reinhard Heckel, DataComp-LM team, Technical University of Munich, OpenAI, and HuggingFace, for computational resources and platform support.


Contact: [email protected]

For questions, issues, or collaboration opportunities, please open an issue or contact the authors directly.

About

A data curation pipeline in German, inspired by DCLM-Baseline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published