A scalable web crawler built with Scrapy to extract job listings from multiple platforms. The project includes:
- Modular spiders for targeted websites (e.g., LinkedinSpider, IndeedSpider).
- Data pipelines for cleaning and storing results (CSV, JSON, or databases).
- Configurable settings (user-agent, delay, proxies) to avoid blocking.
To use the jobs crawler, you need to meet the following requirements:
- Python 3.x
- Scrapy library
- Other dependency libraries
Once you have the required environment set up, you can follow these steps to install and run the tool:
- Clone the project repository
- Navigate to the project directory:
cd jobs-crawler - Install the required dependencies:
pip install -r requirements.txt
Navigate to your project directory and run the spider using the following command:
scrapy crawl demo_spider -a domain=example.com -a url=http://www.example.com -o links.csvThis command will start the spider and save the results in a CSV file named links.csv.
Remember to replace 'example.com' with the actual domain you want to check for links, and adjust the starting URL if needed.
Please be aware that using this tool may exert slight access pressure on the target website; please use responsibly within reasonable limits. This tool is provided as-is, and we are not responsible for any issues caused by its use.
If you discover any issues or opportunities for improvement, feel free to raise an Issue or submit a Pull Request. Your contributions are greatly appreciated!
This project is open-source under the Apache License 2.0.