Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A scalable web crawler built with ​Scrapy​ to extract job listings from multiple platforms.

License

Notifications You must be signed in to change notification settings

wangeguo/jobs-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jobs Crawler

A scalable web crawler built with ​Scrapy​ to extract job listings from multiple platforms. The project includes:

  • Modular ​spiders​ for targeted websites (e.g., LinkedinSpider, IndeedSpider).
  • Data pipelines for cleaning and storing results (CSV, JSON, or databases).
  • Configurable settings (user-agent, delay, proxies) to avoid blocking.

Installation

To use the jobs crawler, you need to meet the following requirements:

  • Python 3.x
  • Scrapy library
  • Other dependency libraries

Once you have the required environment set up, you can follow these steps to install and run the tool:

  1. Clone the project repository
  2. Navigate to the project directory: cd jobs-crawler
  3. Install the required dependencies: pip install -r requirements.txt

Usage Instructions

Navigate to your project directory and run the spider using the following command:

scrapy crawl demo_spider -a domain=example.com -a url=http://www.example.com -o links.csv

This command will start the spider and save the results in a CSV file named links.csv.

Remember to replace 'example.com' with the actual domain you want to check for links, and adjust the starting URL if needed.

Notes

Please be aware that using this tool may exert slight access pressure on the target website; please use responsibly within reasonable limits. This tool is provided as-is, and we are not responsible for any issues caused by its use.

Contributions

If you discover any issues or opportunities for improvement, feel free to raise an Issue or submit a Pull Request. Your contributions are greatly appreciated!

License

This project is open-source under the Apache License 2.0.

About

A scalable web crawler built with ​Scrapy​ to extract job listings from multiple platforms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages