π Discover Hidden Domains & Subdomains from PDFs Effortlessly!
SubPDF is a sleek, powerful, and user-friendly command-line tool designed to extract domains and subdomains from PDF files with lightning speed. Whether you're a cybersecurity analyst, data scientist, or simply managing extensive document repositories, SubPDF streamlines the process of uncovering valuable link information embedded within PDFs.
- π Features
- π» Installation
- π οΈ Usage
- π Output Formats
- π₯ Examples
- π¦ Dependencies
- π€ Contributing
- π License
- π Contact
- π Acknowledgements
- β‘ Multi-threaded Fast Processing: Utilize multi-threading to handle numerous PDFs concurrently, drastically reducing processing time.
- π Comprehensive Link Extraction: Extracts both domains and subdomains from clickable annotations and text within PDFs.
- π Flexible Input Sources: Accepts single URLs, direct PDF links, input lists (text files), and JSON files containing URLs.
- π₯οΈ Customizable Output: Choose from various output formats including JSON, simple lists, detailed reports, and more.
- ποΈ Ephemeral & Permanent Storage: Option to store downloaded PDFs temporarily (auto-deleted after processing) or permanently.
- π Custom HTTP Headers: Add custom headers to HTTP requests for enhanced compatibility and access.
git clone https://github.com/securezeron/SubPDF.git
cd SubPDF
python -m venv venv
- Activate the Virtual Environment:
- Windows:
venv\Scripts\activate
- Unix/Linux/MacOS:
source venv/bin/activate
- Windows:
pip install -r requirements.txt
Note: Ensure you have Python 3.7 or higher installed.
SubPDF is a command-line tool. Below are instructions to help you get started seamlessly.
python main.py [OPTIONS]
-
π₯ Input Sources:
-
-u
,--url
Description: Webpage URL to crawl for PDF links.
Usage:python main.py -u https://www.example.com
-
-pu
,--pdf-url
Description: Direct PDF link to parse.
Usage:python main.py -pu https://www.example.com/sample.pdf
-
-il
,--input-list
Description: A text file containing one URL per line.
Usage:python main.py -il urls.txt
-
-ij
,--input-json
Description: A JSON file containing a list of URLs or a 'urls' array.
Usage:python main.py -ij urls.json
-
-
ποΈ Storage Options:
-f
,--download-folder
Description: Directory to store downloaded PDFs permanently. If omitted, PDFs are stored temporarily and auto-deleted after processing.
Usage:python main.py -u https://www.example.com -f /path/to/downloads
-
π Custom HTTP Headers:
-H
,--header
Description: Add custom HTTP headers in the format'HeaderName: Value'
. Can be used multiple times.
Usage:python main.py -u https://www.example.com -H "Authorization: Bearer YOUR_TOKEN" -H "Custom-Header: Value"
-
π€ Output Configuration:
-
--format
Description: Choose the output format.
Options:default
,simple
,json
,list
,domains
Default:default
Usage:python main.py -u https://www.example.com --format json
-
--output-file
Description: Write the final output to a specified file. If omitted, output is printed to stdout.
Usage:python main.py -u https://www.example.com --output-file results.json
-
-
βοΈ Performance:
-t
,--threads
Description: Number of threads to use for PDF processing.
Default:100
Usage:python main.py -u https://www.example.com -t 50
-
π Debugging:
--debug
Description: Enable debug logs for detailed information.
Usage:python main.py -u https://www.example.com --debug
For a comprehensive list of options and their descriptions, use:
python main.py --help
SubPDF offers multiple output formats to cater to different user needs:
-
Default (
default
):- Description: Detailed report listing each PDF and its associated domains and subdomains.
- Usage:
python main.py -u https://www.example.com --format default
-
Simple (
simple
):- Description: Line-by-line list of all extracted domains and subdomains without additional labeling.
- Usage:
python main.py -u https://www.example.com --format simple
-
JSON (
json
):- Description: Structured JSON output mapping each PDF to its domains and subdomains.
- Usage:
python main.py -u https://www.example.com --format json --output-file results.json
-
List (
list
):- Description: Bulleted list format for easy readability.
- Usage:
python main.py -u https://www.example.com --format list
-
Domains Only (
domains
):- Description: Extracted registered domains using
tldextract
, omitting subdomains for a cleaner overview. - Usage:
python main.py -u https://www.example.com --format domains
- Description: Extracted registered domains using
python main.py -u https://www.example.com
- Description: Crawls the specified webpage for PDF links, extracts domains and subdomains from each PDF, and displays the results in the default format.
python main.py -il urls.txt -H "Authorization: Bearer YOUR_TOKEN" -H "Custom-Header: Value" --output-file domains.json
- Description: Reads URLs from
urls.txt
, uses custom HTTP headers for requests, extracts domains and subdomains, and saves the results indomains.json
.
python main.py -pu https://www.example.com/sample.pdf --format json --debug
- Description: Processes a direct PDF link, outputs the results in JSON format, and provides detailed debug logs for troubleshooting.
python main.py -u https://www.example.com -t 50 -f ./downloads
- Description: Uses up to 50 threads for processing, stores downloaded PDFs in the
./downloads
directory permanently.
SubPDF relies on the following Python libraries:
- Python 3.7+
- Requests
- BeautifulSoup4
- PyPDF2
- tldextract
- tqdm
Installation via pip
:
pip install -r requirements.txt
Contents of requirements.txt
:
requests
beautifulsoup4
PyPDF2
tldextract
tqdm
Contributions are the heart of SubPDF! Whether it's reporting a bug, suggesting a feature, or submitting a pull request, your input helps make SubPDF better for everyone.
Click the Fork button at the top right of the repository page.
git checkout -b feature/YourFeatureName
Implement your feature or fix the bug.
git commit -m "Add a descriptive message about your changes"
git push origin feature/YourFeatureName
Navigate to the original repository and click on New Pull Request.
Distributed under the MIT License.
Note: Ensure that you include a
LICENSE
file in your repository with the appropriate license text.
Suman Chakraborty
Email: [email protected]
LinkedIn: linkedin.com/in/suman-chakraborty-b857901b1
GitHub: github.com/suman-zeron
For any inquiries, issues, or feedback, feel free to reach out!
Made with β€οΈ by Suman@ZERON