5 releases
Uses new Rust 2024
| 0.6.5 | Apr 1, 2026 |
|---|---|
| 0.6.4 | Apr 1, 2026 |
| 0.6.3 | Apr 1, 2026 |
| 0.6.2 | Apr 1, 2026 |
| 0.6.1 | Apr 1, 2026 |
#1543 in Filesystem
130KB
791 lines
A polite downloader for Common Crawl data, written in Rust.
Install
cargo install ccdown
Other methods
From source
git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .
Pre-built binaries
Grab the latest release for your platform from the releases page.
Usage
1. Download the path manifest for a crawl
ccdown download-paths CC-MAIN-2025-08 warc ./paths
Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table
Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM
2. Download the actual data
ccdown download ./paths/warc.paths.gz ./data
Options
| Flag | Description | Default |
|---|---|---|
-t |
Number of concurrent downloads | 10 |
-r |
Max retries per file | 1000 |
-p |
Show progress bars | off |
-f |
Flat file output (no directory structure) | off |
-n |
Numbered output (for Ungoliant Pipeline) | off |
-s |
Abort on unrecoverable errors (401, 403, 404) | off |
Example
ccdown download -p -t 5 ./paths/warc.paths.gz ./data
Note: Keep threads at 10 or below. Too many concurrent requests will get you
403'd by the server, and those errors are unrecoverable.
Python bindings
Install
pip install ccdown
Usage
from ccdown import Client
client = Client(threads=10, retries=1000, progress=True)
# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")
# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")
# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")
# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")
API
Client(threads=10, retries=1000, progress=False) — Create a client with shared config.
client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.
client.download(path_file) — Returns a builder with chainable options:
.files_only()— flatten directory structure.numbered()— enumerate output files (for Ungoliant).strict()— abort on unrecoverable HTTP errors.to(dst)— execute the download
License
MIT OR Apache-2.0
Dependencies
~17–36MB
~418K SLoC