An image scraping application that downloads images from Bing based on keywords provided in a CSV file. The application leverages Scrapy and allows for concurrent downloading of images using multiprocessing and threading. Scrape MILLIONS of images / day.
- Keyword-Based Image Downloading: Provide a list of keywords, and the application will download images related to those keywords.
- Concurrent Processing: Uses multiprocessing and threading to efficiently scrape images in parallel.
- Customizable Output: Specify the output folder where images will be saved.
- Error Handling: Robust error handling to ensure the application continues running even if some tasks fail.
- Python 3.6+
- Scrapy
- Pandas
-
Clone the Repository
git clone https://github.com/birdhouses/image_scraper.git cd image_scraper -
Install Dependencies
Install the required Python packages:
pip install scrapy pandas
Ensure that you have
Scrapy,pandas, and all their dependencies installed.
├── modules
│ └── scraper.py # Contains the ImageScraper and BingImageSpider classes
├── app.py # Main script to run the image scraper
├── keywords.csv # CSV file containing keywords (you need to create this)
├── downloaded_images # Folder where images will be saved (created automatically)
Create a CSV file named keywords.csv in the root directory. This file should contain a column named keyword with the list of keywords you want to search for.
Example keywords.csv:
keyword
Cat
Dog
SunsetIn app.py, you can hardcode the CSV file path and the output folder name.
# app.py
from modules.scraper import ImageScraper
if __name__ == '__main__':
# Hardcode the CSV file path and output folder name here
csv_file_path = 'keywords.csv' # Replace with your CSV file path
folder_name = 'downloaded_images' # Replace with your desired output folder
# Create an instance of the ImageScraper
scraper = ImageScraper()
# Call the process_image_search method with the hardcoded parameters
result = scraper.process_image_search(csv_file_path=csv_file_path, workers=5, folder_name=folder_name)
print(result)Ensure that the csv_file_path points to your CSV file and folder_name is the directory where you want the images to be saved.
Execute the app.py script:
python app.pyThe script will read the keywords from the CSV file and download images for each keyword into the specified folder.
After the script finishes running, you can find the downloaded images in the downloaded_images folder (or the folder name you specified).
You can adjust the workers parameter in app.py to change the number of concurrent threads used for downloading images.
result = scraper.process_image_search(csv_file_path=csv_file_path, workers=5, folder_name=folder_name)Increasing the number of workers may speed up the downloading process but could also lead to being blocked by the target website if too many requests are made simultaneously.
The application currently uses Bing for image searches. If you want to use a different search engine, you would need to modify the BingImageSpider class in modules/scraper.py.
Ensure you have the following Python packages installed:
- Scrapy
- Pandas
- Twisted (installed automatically with Scrapy)
- lxml (used by Scrapy for parsing)
Install them using:
pip install scrapy pandas- Respect Website Policies: Make sure you comply with Bing's terms of service and robots.txt policies when scraping images.
- Copyright: Be aware of copyright laws when downloading images. Use the images responsibly and ensure you have the rights or permissions to use them.
- Usage of Images: This application is intended for educational purposes. The developers are not responsible for any misuse of the downloaded images.
-
CSV File Not Found: If you receive an error stating that the CSV file is not found, ensure that the
csv_file_pathinapp.pypoints to the correct file and that the file exists. -
No Images Downloaded: If no images are downloaded, check that your keywords are valid and that the search engine returns results for them.
-
Timeout Errors: If you encounter timeout errors, you may increase the
timeoutvalue inmodules/scraper.py:future.result(timeout=60) # Adjust timeout as needed
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
This project is licensed under the MIT License.
- Scrapy - An open-source and collaborative framework for extracting the data you need from websites.
- Pandas - An open-source data analysis and manipulation tool.
- Bing - For providing image search services.
Note: This application is intended for educational purposes. Please ensure you comply with all relevant laws and website terms of service when scraping data.