OpenINTEL Domain Fetcher

A Python utility for downloading, chunking, and analyzing domain data from the OpenINTEL project.

Note: This project is currently under development. Features may change and the API is not yet stable.

Overview

This tool downloads domain information from OpenINTEL datasets, processes large parquet files into smaller chunks, and provides options for converting to CSV format. It's designed for efficient handling of large datasets on systems with limited resources.

Features

Download zone files from OpenINTEL
Stream large parquet files directly to manageable chunks
Process existing parquet files into chunks
Convert parquet files to CSV format
Filter downloads by date, source, and more
Memory-efficient processing through chunking
Graceful shutdown with Ctrl+C

Requirements

beautifulsoup4
python-whois
tldextract
dnspython
openpyxl
ipwhois
pyarrow
pandas
requests

Installation

Clone this repository:

git clone https://github.com/yourusername/todayDomains.git
cd todayDomains

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Basic Usage

python main.py --accept-terms --date 2025-03-29

Command Line Arguments

--accept-terms: Required. Accept the OpenINTEL terms of service
--date: Optional. Specific date to fetch (YYYY-MM-DD)
--sources: Optional. Specific sources to process (e.g., 'source=com')
--max-days: Optional. Maximum days per month to process
--csv: Optional. Convert .parquet files to .csv
--no-chunking: Optional. Disable chunking for large files
--chunk-size: Optional. Specify chunk size (default: 8192 rows)

Examples

Download data for a specific date:

python main.py --accept-terms --date 2025-03-29

Download data and convert to CSV:

python main.py --accept-terms --date 2025-03-29 --csv

Process only specific TLDs:

python main.py --accept-terms --date 2025-03-29 --sources "source=com" "source=net"

Specify a larger chunk size:

python main.py --accept-terms --date 2025-03-29 --chunk-size 16384

Chunking Behavior

The tool automatically streams downloaded files into chunks to optimize memory usage:

Files are downloaded to a temporary location
The data is processed into chunks of the specified size
Each chunk is stored as a separate parquet file in a folder named after the original file
A manifest.json file is created with metadata about the chunks
The temporary file is removed

File Structure

todayDomains/
├── main.py             # Main program file
├── analyze.py          # Analysis utilities
├── requirements.txt    # Package requirements
├── config/             # Configuration files
├── zoneFile/           # Downloaded zone files and chunks
└── worker.log          # Program log file

Interrupting the Program

The program can be safely interrupted with Ctrl+C. It will:

Stop requesting new downloads
Finish processing the current file when possible
Clean up any temporary files
Shut down gracefully

License

CC BY-NC 4.0 License

Acknowledgments

Data provided by the OpenINTEL project.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
README.md		README.md
analyze.py		analyze.py
main.py		main.py
openIntelCrawler.py		openIntelCrawler.py
requirements.txt		requirements.txt
url_structure.json		url_structure.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenINTEL Domain Fetcher

Overview

Features

Requirements

Installation

Usage

Basic Usage

Command Line Arguments

Examples

Chunking Behavior

File Structure

Interrupting the Program

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

R0drigo-0/domainsAnalyzer

Folders and files

Latest commit

History

Repository files navigation

OpenINTEL Domain Fetcher

Overview

Features

Requirements

Installation

Usage

Basic Usage

Command Line Arguments

Examples

Chunking Behavior

File Structure

Interrupting the Program

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages