Codestin Search App

A command-line toolkit to extract text content and category data from Wikipedia dump files

Quick Start

# Install
gem install wp2txt

# Extract text from English Wikipedia (auto-download)
wp2txt --lang=en -o ./output

# Extract specific articles
wp2txt --lang=en --articles="Tokyo,Kyoto" -o ./articles

# Extract articles from a category
wp2txt --lang=en --from-category="Cities in Japan" -o ./cities

About

WP2TXT extracts plain text and category information from Wikipedia dump files. It processes XML dumps (compressed with bzip2), removes MediaWiki markup, and outputs clean text suitable for corpus linguistics, text mining, and other research purposes.

Key Features

Auto-download - Automatically download dumps by language code
Article extraction by title - Extract specific articles without downloading full dumps
Category-based extraction - Extract all articles from a specific Wikipedia category
Category metadata extraction - Preserves article category information in output
Template expansion - Expands common templates (dates, units, coordinates) to readable text
Multilingual support - Category and redirect detection for 350+ Wikipedia languages
Streaming processing - Process large dumps without intermediate files
JSON output - Machine-readable JSONL format for data pipelines

Use Cases

wp2txt is particularly suited for:

Building domain-specific corpora using category information
Comparative linguistic research across topic areas
Extracting Wikipedia text with metadata for NLP tasks
Cross-linguistic studies using parallel category structures

Data Access

wp2txt uses official Wikipedia dump files, the recommended method for bulk data access. This approach respects Wikimedia's infrastructure guidelines.

Installation

Install wp2txt

$ gem install wp2txt

System Requirements

WP2TXT requires one of the following commands to decompress bz2 files:

lbzip2 (recommended - uses multiple CPU cores)
pbzip2
bzip2 (pre-installed on most systems)

On macOS with Homebrew:

$ brew install lbzip2

On Windows: Install Bzip2 for Windows and add to PATH.

Docker (Alternative)

docker run -it -v /path/to/localdata:/data yohasebe/wp2txt

The wp2txt command is available inside the container. Use /data for input/output files.

Basic Usage

Auto-download and process (Recommended)

$ wp2txt --lang=en -o ./text

This automatically downloads the English Wikipedia dump and extracts plain text. Downloads are cached in ~/.wp2txt/cache/.

Extract specific articles by title

$ wp2txt --lang=en --articles="Cognitive linguistics,Generative grammar" -o ./articles

Only the index file and necessary data streams are downloaded, making it much faster than processing the full dump.

Extract articles from a category

$ wp2txt --lang=en --from-category="Cities in Japan" -o ./cities

Include subcategories with --depth:

$ wp2txt --lang=en --from-category="Cities in Japan" --depth=2 -o ./cities

Preview without downloading (shows article counts):

$ wp2txt --lang=en --from-category="Cities in Japan" --dry-run

Process local dump file

$ wp2txt -i ./enwiki-20220801-pages-articles.xml.bz2 -o ./text

Other extraction modes

# Category info only (title + categories)
$ wp2txt -g --lang=en -o ./category

# Summary only (title + categories + opening paragraphs)
$ wp2txt -s --lang=en -o ./summary

# Metadata only (title + section headings + categories)
$ wp2txt -M --lang=en --format json -o ./metadata

# Extract specific sections from particular articles (fast)
# Section names are case-insensitive; alias matching is enabled by default
$ wp2txt --lang=en --articles="Tokyo" --sections="summary,history,geography" --format json -o ./sections

# Extract specific sections from a category (moderate)
$ wp2txt --lang=en --from-category="Cities in Japan" --sections="summary,history" --format json -o ./sections

# Extract specific sections from full dump (slow - processes all articles)
$ wp2txt --lang=en --sections="summary,plot,reception" --format json -o ./sections

# Section heading statistics (useful for discovering section names before extraction)
$ wp2txt --lang=en --section-stats -o ./stats

# JSON/JSONL output
$ wp2txt --format json --lang=en -o ./json

Sample Output

Text Output

[[Article Title]]

Article content goes here with sections and paragraphs...

CATEGORIES: Category1, Category2, Category3

JSON/JSONL Output

Each line contains one JSON object:

{"title": "Article Title", "categories": ["Cat1", "Cat2"], "text": "...", "redirect": null}

For redirect articles:

{"title": "NYC", "categories": [], "text": "", "redirect": "New York City"}

Cache Management

$ wp2txt --cache-status           # Show cache status
$ wp2txt --cache-clear            # Clear all cache
$ wp2txt --cache-clear --lang=en  # Clear cache for English only
$ wp2txt --update-cache           # Force fresh download

When cache exceeds the expiry period (default: 30 days), wp2txt displays a warning but allows using cached data.

Advanced Options

Content Type Markers

Special content is replaced with marker placeholders by default:

Inline markers (appear within sentences):

Marker	Content Type
`[MATH]`	Mathematical formulas
`[CODE]`	Inline code
`[CHEM]`	Chemical formulas
`[IPA]`	IPA phonetic notation

Block markers (standalone content):

Marker	Content Type
`[CODEBLOCK]`	Source code blocks
`[TABLE]`	Wiki tables
`[INFOBOX]`	Information boxes
`[NAVBOX]`	Navigation boxes
`[GALLERY]`	Image galleries
`[REFERENCES]`	Reference lists
`[SCORE]`	Musical scores
`[TIMELINE]`	Timeline graphics
`[GRAPH]`	Graphs/charts
`[SIDEBAR]`	Sidebar templates
`[MAPFRAME]`	Interactive maps
`[IMAGEMAP]`	Clickable image maps

Configure with --markers:

$ wp2txt --lang=en --markers=all -o ./text        # All markers (default)
$ wp2txt --lang=en --markers=math,code -o ./text  # Only MATH and CODE

Note: --markers=none is deprecated as removing special content can make surrounding text nonsensical.

Template Expansion

Common MediaWiki templates are automatically expanded (enabled by default):

Template	Output
`{{birth date\|1990\|5\|15}}`	May 15, 1990
`{{convert\|100\|km\|mi}}`	100 km (62 mi)
`{{coord\|35\|41\|N\|139\|41\|E}}`	35°41′N 139°41′E
`{{lang\|ja\|日本語}}`	日本語
`{{nihongo\|Tokyo\|東京\|Tōkyō}}`	Tokyo (東京, Tōkyō)
`{{frac\|1\|2}}`	1/2
`{{circa\|1900}}`	c. 1900

Supported: date/age templates, unit conversion, coordinates, language tags, quotes, fractions, and more. Parser functions ({{#if:}}, {{#switch:}}) and magic words ({{PAGENAME}}, {{CURRENTYEAR}}) are also supported.

Disable with --no-expand-templates.

Citation Extraction

By default, citation templates are removed. Use --extract-citations to extract formatted citations:

$ wp2txt --lang=en --extract-citations -o ./text

Supported: {{cite book}}, {{cite web}}, {{cite news}}, {{cite journal}}, {{Citation}}, etc.

Command Line Options

Usage: wp2txt [options]

Input source (one of --input or --lang required):
  -i, --input=<s>                  Path to compressed file (bz2) or XML file
  -L, --lang=<s>                   Wikipedia language code (e.g., ja, en, de)
  -A, --articles=<s>               Specific article titles (comma-separated)
  -G, --from-category=<s>          Extract articles from Wikipedia category
  -D, --depth=<i>                  Subcategory recursion depth (default: 0)
  -y, --yes                        Skip confirmation prompt
  --dry-run                        Preview category extraction
  -U, --update-cache               Force refresh of cached files

Output options:
  -o, --output-dir=<s>             Output directory (default: current)
  -j, --format=<s>                 Output format: text or json (default: text)
  -f, --file-size=<i>              Output file size in MB (default: 10, 0=single)

Cache management:
  --cache-dir=<s>                  Cache directory (default: ~/.wp2txt/cache)
  --cache-status                   Show cache status and exit
  --cache-clear                    Clear cache and exit

Configuration:
  --config-init                    Create default config (~/.wp2txt/config.yml)
  --config-path=<s>                Path to configuration file

Extraction modes (mutually exclusive):
  -g, --category-only              Extract only title and categories
  -s, --summary-only               Extract title, categories, and summary
  -M, --metadata-only              Extract only title, headings, and categories

Section extraction:
  -S, --sections=<s>               Extract specific sections (comma-separated, case-insensitive)
  --section-output=<s>             Output mode: structured or combined (default: structured)
  --min-section-length=<i>         Minimum section length in characters (default: 0)
  --skip-empty                     Skip articles with no matching sections
  --alias-file=<s>                 Custom section alias definitions file (YAML)
  --no-section-aliases             Disable section alias matching (exact match only)
  --section-stats                  Collect and output section heading statistics (JSON)
  --show-matched-sections          Include matched_sections field in JSON output

Content filtering:
  -a, --category, --no-category    Show category info (default: true)
  -t, --title, --no-title          Keep page titles (default: true)
  -d, --heading, --no-heading      Keep section titles (default: true)
  -l, --list                       Keep list items (default: false)
  --table                          Keep wiki table content (default: false)
  -p, --pre                        Keep preformatted text blocks (default: false)
  -r, --ref                        Keep references as [ref]...[/ref] (default: false)
  --multiline                      Keep multi-line templates (default: false)
  -e, --redirect                   Show redirect destination (default: false)
  -m, --marker, --no-marker        Show list markers (default: true)
  -k, --markers=<s>                Content markers (default: all)
  -C, --extract-citations          Extract formatted citations
  -E, --expand-templates           Expand templates (default: true)
      --no-expand-templates        Disable template expansion

Performance:
  -n, --num-procs=<i>              Parallel processes (default: auto)
  --no-turbo                       Disable turbo mode (saves disk space, slower)
  -R, --ractor                     Use Ractor parallelism (Ruby 4.0+, streaming only)
  -b, --bz2-gem                    Use bzip2-ruby gem instead of system command

Output control:
  -q, --quiet                      Suppress progress output (errors only)
  --no-color                       Disable colored output

Info:
  -v, --version                    Print version
  -h, --help                       Show help

Configuration File

Create persistent settings with:

$ wp2txt --config-init

This creates ~/.wp2txt/config.yml:

cache:
  dump_expiry_days: 30      # Days before dumps are stale (1-365)
  category_expiry_days: 7   # Category cache expiry (1-90)
  directory: ~/.wp2txt/cache

defaults:
  format: text              # Default output format
  depth: 0                  # Default subcategory depth

Command-line options override configuration file settings.

Performance

Benchmark results on MacBook Air M4 (7 parallel processes, turbo mode, excluding download time):

Wikipedia	Dump Size	Articles	Processing Time	Output
Japanese	4.37 GB	1,485,937	~27 min	463 files (4.5 GB)
English	24.2 GB	~6.8M	~2 hours	2,000 files (20 GB)

Turbo mode (default) splits bz2 into XML chunks first, then processes in parallel. Use --no-turbo to save disk space at the cost of slower processing.

Caveats

Special content (math, code, etc.) is marked with placeholders by default.
Some text may not be extracted correctly due to markup variations or language-specific formatting.

Changelog

See CHANGELOG.md for detailed release notes.

v2.1.0 (February 2026): SQLite caching, Ractor parallelism (Ruby 4.0+), template expansion, content markers, Docker image update.

v2.0.0 (January 2026): Auto-download mode, category-based extraction, article extraction by title, JSON output, streaming processing, Ruby 4.0 support.

Useful Links

Wikipedia Database backup dumps

Author

Yoichiro Hasebe ([email protected])

References

The author will appreciate your mentioning one of these in your research.

Yoichiro HASEBE. 2006. Method for using Wikipedia as Japanese corpus. Doshisha Studies in Language and Culture 9(2), 373-403.
長谷部陽一郎. 2006. Wikipedia日本語版をコーパスとして用いた言語研究の手法. 『言語文化』9(2), 373-403.

BibTeX:

@misc{wp2txt_2026,
  author = {Yoichiro Hasebe},
  title = {WP2TXT: A command-line toolkit to extract text content and category data from Wikipedia dump files},
  url = {https://github.com/yohasebe/wp2txt},
  year = {2026}
}

License

This software is distributed under the MIT License. Please see the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.github		.github
bin		bin
image		image
lib		lib
scripts		scripts
spec		spec
.dockerignore		.dockerignore
.gitignore		.gitignore
.solargraph.yml		.solargraph.yml
CHANGELOG.md		CHANGELOG.md
DEVELOPMENT.md		DEVELOPMENT.md
DEVELOPMENT_ja.md		DEVELOPMENT_ja.md
Dockerfile		Dockerfile
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
Rakefile		Rakefile
wp2txt.gemspec		wp2txt.gemspec

Uh oh!

License

yohasebe/wp2txt

Folders and files

Latest commit

History

Repository files navigation

Quick Start

About

Key Features

Use Cases

Data Access

Installation

Install wp2txt

System Requirements

Docker (Alternative)

Basic Usage

Auto-download and process (Recommended)

Extract specific articles by title

Extract articles from a category

Process local dump file

Other extraction modes

Sample Output

Text Output

JSON/JSONL Output

Cache Management

Advanced Options

Content Type Markers

Template Expansion

Citation Extraction

Command Line Options

Configuration File

Performance

Caveats

Changelog

Useful Links

Author

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages