wicket

Wikipedia corpus knowledge extractor.

Features

Streaming XML parsing that handles multi-gigabyte dumps without loading them into memory
Parallel text extraction using multiple CPU cores via rayon
Automatic bzip2 decompression for .xml.bz2 dump files
Doc format and JSON format output
File splitting with configurable maximum size per file
Namespace filtering to extract only specific page types

Installation

From crates.io

cargo install wicket-cli

From source

Requires Rust 1.85 or later.

git clone https://github.com/mosuka/wicket.git
cd wicket
cargo build --release

Quick Start

# Download a Wikipedia dump
wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

# Extract plain text
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/

# JSON output
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json

# Write to stdout
wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50

CLI Options

Option	Description	Default
`<INPUT>`	Input Wikipedia XML dump file (`.xml` or `.xml.bz2`)	(required)
`-o, --output`	Output directory, or `-` for stdout	`text`
`-b, --bytes`	Maximum bytes per output file (e.g., `1M`, `500K`, `1G`)	`1M`
`-c, --compress`	Compress output files using bzip2	`false`
`--json`	Write output in JSON format	`false`
`--processes`	Number of parallel workers	CPU count
`-q, --quiet`	Suppress progress output on stderr	`false`
`--namespaces`	Comma-separated namespace IDs to extract	`0`

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Library Usage

Add to your Cargo.toml:

[dependencies]
wicket = "0.1.0"

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader.take(5) {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        println!("{}", output);
    }

    Ok(())
}

Documentation

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
docs		docs
resources		resources
wicket-cli		wicket-cli
wicket		wicket
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wicket

Features

Installation

From crates.io

From source

Quick Start

CLI Options

Output Formats

Doc Format (default)

JSON Format

Library Usage

Documentation

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wicket

Features

Installation

From crates.io

From source

Quick Start

CLI Options

Output Formats

Doc Format (default)

JSON Format

Library Usage

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages