🚀 SubPDF 🚀

🌐 Discover Hidden Domains & Subdomains from PDFs Effortlessly!

SubPDF is a sleek, powerful, and user-friendly command-line tool designed to extract domains and subdomains from PDF files with lightning speed. Whether you're a cybersecurity analyst, data scientist, or simply managing extensive document repositories, SubPDF streamlines the process of uncovering valuable link information embedded within PDFs.

📌 Table of Contents

🌟 Features

⚡ Multi-threaded Fast Processing: Utilize multi-threading to handle numerous PDFs concurrently, drastically reducing processing time.
🔍 Comprehensive Link Extraction: Extracts both domains and subdomains from clickable annotations and text within PDFs.
📂 Flexible Input Sources: Accepts single URLs, direct PDF links, input lists (text files), and JSON files containing URLs.
🖥️ Customizable Output: Choose from various output formats including JSON, simple lists, detailed reports, and more.
🗄️ Ephemeral & Permanent Storage: Option to store downloaded PDFs temporarily (auto-deleted after processing) or permanently.
🔐 Custom HTTP Headers: Add custom headers to HTTP requests for enhanced compatibility and access.

💻 Installation

1. Clone the Repository

git clone https://github.com/securezeron/SubPDF.git
cd SubPDF

2. Create a Virtual Environment (Recommended)

python -m venv venv

Activate the Virtual Environment:
- Windows:
```
venv\Scripts\activate
```
- Unix/Linux/MacOS:
```
source venv/bin/activate
```

3. Install Dependencies

pip install -r requirements.txt

Note: Ensure you have Python 3.7 or higher installed.

🛠️ Usage

SubPDF is a command-line tool. Below are instructions to help you get started seamlessly.

📚 Basic Commands

python main.py [OPTIONS]

🔧 Advanced Options

📥 Input Sources:
- -u, --url
  Description: Webpage URL to crawl for PDF links.
  Usage:
```
python main.py -u https://www.example.com
```
- -pu, --pdf-url
  Description: Direct PDF link to parse.
  Usage:
```
python main.py -pu https://www.example.com/sample.pdf
```
- -il, --input-list
  Description: A text file containing one URL per line.
  Usage:
```
python main.py -il urls.txt
```
- -ij, --input-json
  Description: A JSON file containing a list of URLs or a 'urls' array.
  Usage:
```
python main.py -ij urls.json
```
🗄️ Storage Options:
- -f, --download-folder
  Description: Directory to store downloaded PDFs permanently. If omitted, PDFs are stored temporarily and auto-deleted after processing.
  Usage:
```
python main.py -u https://www.example.com -f /path/to/downloads
```
🔐 Custom HTTP Headers:
- -H, --header
  Description: Add custom HTTP headers in the format 'HeaderName: Value'. Can be used multiple times.
  Usage:
```
python main.py -u https://www.example.com -H "Authorization: Bearer YOUR_TOKEN" -H "Custom-Header: Value"
```
📤 Output Configuration:
- --format
  Description: Choose the output format.
  Options: default, simple, json, list, domains
  Default: default
  Usage:
```
python main.py -u https://www.example.com --format json
```
- --output-file
  Description: Write the final output to a specified file. If omitted, output is printed to stdout.
  Usage:
```
python main.py -u https://www.example.com --output-file results.json
```
⚙️ Performance:
- -t, --threads
  Description: Number of threads to use for PDF processing.
  Default: 100
  Usage:
```
python main.py -u https://www.example.com -t 50
```
🐞 Debugging:
- --debug
  Description: Enable debug logs for detailed information.
  Usage:
```
python main.py -u https://www.example.com --debug
```

🔍 Help Command

For a comprehensive list of options and their descriptions, use:

python main.py --help

📈 Output Formats

SubPDF offers multiple output formats to cater to different user needs:

Default (default):
- Description: Detailed report listing each PDF and its associated domains and subdomains.
- Usage:
```
python main.py -u https://www.example.com --format default
```
Simple (simple):
- Description: Line-by-line list of all extracted domains and subdomains without additional labeling.
- Usage:
```
python main.py -u https://www.example.com --format simple
```
JSON (json):
- Description: Structured JSON output mapping each PDF to its domains and subdomains.
- Usage:
```
python main.py -u https://www.example.com --format json --output-file results.json
```
List (list):
- Description: Bulleted list format for easy readability.
- Usage:
```
python main.py -u https://www.example.com --format list
```
Domains Only (domains):
- Description: Extracted registered domains using tldextract, omitting subdomains for a cleaner overview.
- Usage:
```
python main.py -u https://www.example.com --format domains
```

🔥 Examples

1. Extract Domains from a Single Webpage

python main.py -u https://www.example.com

Description: Crawls the specified webpage for PDF links, extracts domains and subdomains from each PDF, and displays the results in the default format.

2. Process Multiple PDFs from a List with Custom Headers

python main.py -il urls.txt -H "Authorization: Bearer YOUR_TOKEN" -H "Custom-Header: Value" --output-file domains.json

Description: Reads URLs from urls.txt, uses custom HTTP headers for requests, extracts domains and subdomains, and saves the results in domains.json.

3. Extract Domains in JSON Format with Debug Logs

python main.py -pu https://www.example.com/sample.pdf --format json --debug

Description: Processes a direct PDF link, outputs the results in JSON format, and provides detailed debug logs for troubleshooting.

4. Limit Processing to 50 Threads and Store PDFs Permanently

python main.py -u https://www.example.com -t 50 -f ./downloads

Description: Uses up to 50 threads for processing, stores downloaded PDFs in the ./downloads directory permanently.

📦 Dependencies

SubPDF relies on the following Python libraries:

Installation via pip:

pip install -r requirements.txt

Contents of requirements.txt:

requests
beautifulsoup4
PyPDF2
tldextract
tqdm

🤝 Contributing

Contributions are the heart of SubPDF! Whether it's reporting a bug, suggesting a feature, or submitting a pull request, your input helps make SubPDF better for everyone.

1. Fork the Repository

Click the Fork button at the top right of the repository page.

2. Create a New Branch

git checkout -b feature/YourFeatureName

3. Make Your Changes

Implement your feature or fix the bug.

4. Commit Your Changes

git commit -m "Add a descriptive message about your changes"

5. Push to Your Fork

git push origin feature/YourFeatureName

6. Create a Pull Request

Navigate to the original repository and click on New Pull Request.

📝 License

Distributed under the MIT License.

Note: Ensure that you include a LICENSE file in your repository with the appropriate license text.

📞 Contact

Suman Chakraborty
Email: [email protected]
LinkedIn: linkedin.com/in/suman-chakraborty-b857901b1
GitHub: github.com/suman-zeron

For any inquiries, issues, or feedback, feel free to reach out!

🙏 Acknowledgements

Made with ❤️ by Suman@ZERON

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
SubPDF.py		SubPDF.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 SubPDF 🚀

🌐 Discover Hidden Domains & Subdomains from PDFs Effortlessly!

📌 Table of Contents

🌟 Features

💻 Installation

1. Clone the Repository

2. Create a Virtual Environment (Recommended)

3. Install Dependencies

🛠️ Usage

📚 Basic Commands

🔧 Advanced Options

🔍 Help Command

📈 Output Formats

🔥 Examples

1. Extract Domains from a Single Webpage

2. Process Multiple PDFs from a List with Custom Headers

3. Extract Domains in JSON Format with Debug Logs

4. Limit Processing to 50 Threads and Store PDFs Permanently

📦 Dependencies

🤝 Contributing

1. Fork the Repository

2. Create a New Branch

3. Make Your Changes

4. Commit Your Changes

5. Push to Your Fork

6. Create a Pull Request

📝 License

📞 Contact

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

securezeron/SubPDF

Folders and files

Latest commit

History

Repository files navigation

🚀 SubPDF 🚀

🌐 Discover Hidden Domains & Subdomains from PDFs Effortlessly!

📌 Table of Contents

🌟 Features

💻 Installation

1. Clone the Repository

2. Create a Virtual Environment (Recommended)

3. Install Dependencies

🛠️ Usage

📚 Basic Commands

🔧 Advanced Options

🔍 Help Command

📈 Output Formats

🔥 Examples

1. Extract Domains from a Single Webpage

2. Process Multiple PDFs from a List with Custom Headers

3. Extract Domains in JSON Format with Debug Logs

4. Limit Processing to 50 Threads and Store PDFs Permanently

📦 Dependencies

🤝 Contributing

1. Fork the Repository

2. Create a New Branch

3. Make Your Changes

4. Commit Your Changes

5. Push to Your Fork

6. Create a Pull Request

📝 License

📞 Contact

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages