Thanks to visit codestin.com
Credit goes to github.com

Skip to content

numberforty/cc-codex-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CC Codex Crawler

Programmer: Hasan Alqahtani

CC Codex Crawler provides a small Python utility for extracting files from the Common Crawl dataset. Earlier versions only operated on local WARC files. The project now includes a lightweight fetcher inspired by commoncrawl-fetcher-lite which downloads records based on the public indices.

Installation

Install the required dependencies with pip:

pip install -r requirements.txt

Running the fetcher

The fetcher.py script reads a JSON configuration describing which records to extract from the Common Crawl indices. A default example is provided as config.json. A minimal configuration looks like:

{
  "dryRun": true,
  "indices": {"paths": ["crawl-data/CC-MAIN-2023-06/cc-index.paths.gz"]},
  "recordSelector": {
    "must": {"status": [{"match": "200"}]},
    "must_not": {"mime": [{"match": "video/avi"}]},
    "should": {"mime-detected": [{"match": "video/mp4"}]}
  }
}

Run the fetcher with:

python fetcher.py config.json

If dryRun is set to false the matching files are downloaded and stored in the directory specified by outputDir.

Streaming Processor

streaming_processor.py asynchronously streams gzipped index files. Create a YAML configuration similar to config_template.yaml and run:

python streaming_processor.py config_template.yaml

The processor logs progress and retries with exponential backoff on HTTP 503 responses.

Sampling MP3 URLs

Set record_limit to the number of results you want and use extension_filter to restrict matches by file extension:

urls:
  - "https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00001.gz"
record_limit: 100
extension_filter: ".mp3"

Run the processor with your configuration:

python streaming_processor.py config.yaml

Each matching record's URL and timestamp will be printed to the console.

Local crawler

The previous local crawler is still available as crawler.py. It scans local WARC files and saves matching records based on file extension. See docs/USAGE.md for details.

Documentation

This README covers installation and quick-start examples. See docs/USAGE.md for extended usage instructions and additional examples.

Contributing

Contributions are welcome! See CONTRIBUTING.md for instructions. The repository uses pre-commit hooks and pytest for testing all pull requests.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published