A Python utility for downloading, chunking, and analyzing domain data from the OpenINTEL project.
Note: This project is currently under development. Features may change and the API is not yet stable.
This tool downloads domain information from OpenINTEL datasets, processes large parquet files into smaller chunks, and provides options for converting to CSV format. It's designed for efficient handling of large datasets on systems with limited resources.
- Download zone files from OpenINTEL
- Stream large parquet files directly to manageable chunks
- Process existing parquet files into chunks
- Convert parquet files to CSV format
- Filter downloads by date, source, and more
- Memory-efficient processing through chunking
- Graceful shutdown with Ctrl+C
beautifulsoup4
python-whois
tldextract
dnspython
openpyxl
ipwhois
pyarrow
pandas
requests
-
Clone this repository:
git clone https://github.com/yourusername/todayDomains.git cd todayDomains -
Install the required packages:
pip install -r requirements.txt
python main.py --accept-terms --date 2025-03-29
--accept-terms: Required. Accept the OpenINTEL terms of service--date: Optional. Specific date to fetch (YYYY-MM-DD)--sources: Optional. Specific sources to process (e.g., 'source=com')--max-days: Optional. Maximum days per month to process--csv: Optional. Convert .parquet files to .csv--no-chunking: Optional. Disable chunking for large files--chunk-size: Optional. Specify chunk size (default: 8192 rows)
Download data for a specific date:
python main.py --accept-terms --date 2025-03-29
Download data and convert to CSV:
python main.py --accept-terms --date 2025-03-29 --csv
Process only specific TLDs:
python main.py --accept-terms --date 2025-03-29 --sources "source=com" "source=net"
Specify a larger chunk size:
python main.py --accept-terms --date 2025-03-29 --chunk-size 16384
The tool automatically streams downloaded files into chunks to optimize memory usage:
- Files are downloaded to a temporary location
- The data is processed into chunks of the specified size
- Each chunk is stored as a separate parquet file in a folder named after the original file
- A manifest.json file is created with metadata about the chunks
- The temporary file is removed
todayDomains/
├── main.py # Main program file
├── analyze.py # Analysis utilities
├── requirements.txt # Package requirements
├── config/ # Configuration files
├── zoneFile/ # Downloaded zone files and chunks
└── worker.log # Program log file
The program can be safely interrupted with Ctrl+C. It will:
- Stop requesting new downloads
- Finish processing the current file when possible
- Clean up any temporary files
- Shut down gracefully
CC BY-NC 4.0 License
Data provided by the OpenINTEL project.