Universal Scraper

This script enables users to scrape data from a webpage and store it in a MongoDB database. The extracted data is structured based on JSON objects specified in the prompt, providing a standardized way to collect detailed product information, metadata, tags, and more. The script can process single URLs or a batch of URLs from a CSV file, and it utilizes both Jina AI and OpenAI APIs for data extraction.

Setup

Environment Variables:
Make sure you have a .env file with the following variables:
- JINA_API_KEY: Your Jina AI API key.
- OPENAI_API_KEY: Your OpenAI API key.
- MONGODB_URI: URI for MongoDB connection.

Dependencies:
Install the required libraries with:

pip install python-dotenv requests openai pymongo

Usage

JSON Prompt Customization

The script uses JSON structures within the OpenAI prompt to extract specific data fields. Modify these JSON objects in the extract_structured_data method to suit your desired data fields and structure. This customization allows you to capture precisely the data you need from each webpage.

Running the Script

Single URL Processing
To scrape data from a single URL:

scraper.process_single_url("https://example.com/product", "output.json")

Multiple URLs from a CSV File
To process multiple URLs stored in a CSV file (one URL per line):
```
scraper.process_urls('urls.csv', 'scraped_data.json')
```

Output

MongoDB Storage: Scraped data is stored in MongoDB, with each entry associated with a unique document ID based on its URL.
JSON Output File: If an output file is specified, the results will also be saved as a JSON file.

Additional Notes

CSV Input: The script reads URLs from a CSV file where each row contains a single URL.
Error Handling: Includes basic error handling for both network and database operations.
Debugging: The script prints debug information for requests, headers, and MongoDB operations to help troubleshoot issues.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env		.env
README.md		README.md
hi.py		hi.py
readme.md		readme.md
scraped_data.json		scraped_data.json
urls.csv		urls.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Universal Scraper

Setup

Usage

JSON Prompt Customization

Running the Script

Output

Additional Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

IncomeStreamSurfer/universal_scraping_script

Folders and files

Latest commit

History

Repository files navigation

Universal Scraper

Setup

Usage

JSON Prompt Customization

Running the Script

Output

Additional Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages