Programmer: Hasan Alqahtani
CC Codex Crawler provides a small Python utility for extracting files from the
Common Crawl dataset. Earlier versions only operated on local WARC files.
The project now includes a lightweight fetcher inspired by
commoncrawl-fetcher-lite
which downloads records based on the public indices.
Install the required dependencies with pip:
pip install -r requirements.txtThe fetcher.py script reads a JSON configuration describing which records to
extract from the Common Crawl indices. A default example is provided as
config.json. A minimal configuration looks like:
{
"dryRun": true,
"indices": {"paths": ["crawl-data/CC-MAIN-2023-06/cc-index.paths.gz"]},
"recordSelector": {
"must": {"status": [{"match": "200"}]},
"must_not": {"mime": [{"match": "video/avi"}]},
"should": {"mime-detected": [{"match": "video/mp4"}]}
}
}Run the fetcher with:
python fetcher.py config.json
If dryRun is set to false the matching files are downloaded and stored in
the directory specified by outputDir.
streaming_processor.py asynchronously streams gzipped index files. Create a
YAML configuration similar to config_template.yaml and run:
python streaming_processor.py config_template.yamlThe processor logs progress and retries with exponential backoff on HTTP 503 responses.
Set record_limit to the number of results you want and use
extension_filter to restrict matches by file extension:
urls:
- "https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00001.gz"
record_limit: 100
extension_filter: ".mp3"Run the processor with your configuration:
python streaming_processor.py config.yamlEach matching record's URL and timestamp will be printed to the console.
The previous local crawler is still available as crawler.py. It scans local
WARC files and saves matching records based on file extension. See
docs/USAGE.md for details.
This README covers installation and quick-start examples. See docs/USAGE.md for extended usage instructions and additional examples.
Contributions are welcome! See CONTRIBUTING.md for
instructions. The repository uses pre-commit hooks and pytest for testing
all pull requests.