A Python-based data collection and processing pipeline for food products from Auchan Romania. The project scrapes product information, processes images for barcodes, and stores the data in a structured format in Supabase.
food_facts/
├── processors/
│ ├── barcodes/ # Barcode processing modules
│ ├── helpers/ # Helper functions and utilities
│ ├── scraper/ # Web scraping modules
│ └── supabase/ # Supabase integration
├── auchan/ # Scraped data storage
│ └── [category]/
│ ├── images/ # Product images
│ ├── product_links.csv
│ └── [category]_processed.csv
├── process_category.py # Main processing script
└── requirements.txt # Project dependencies
- Automated product data collection from Auchan Romania
- Image downloading and barcode extraction
- Specification and nutritional information mapping
- Data cleaning and validation
- Supabase integration for data storage
- Duplicate product detection
- Case-insensitive product name matching
- Python 3.8+
- OpenCV
- pyzbar
- Supabase account and credentials
- Clone the repository:
git clone [repository-url]
cd food_facts- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
Create a
.envfile in the project root with:
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
To find a category ID:
- Open the desired category page on Auchan's website (e.g., https://www.auchan.ro/lactate-carne-mezeluri---peste/lactate/iaurt/c)
- Open your browser's Developer Tools (F12 or right-click -> Inspect)
- Go to the "Network" tab
- In the search/filter box, type:
route - Look for a request that contains:
"domain":"store","id":"store.search#subcategory","params":{"id": - The number after
"id":is your category ID
Example:
{
"route": {
"domain": "store",
"id": "store.search#subcategory",
"params": {
"id": "2030300" // This is your category ID
}
}
}To process a category, use the following command:
python process_category.py "CATEGORY_URL" --category-id "CATEGORY_ID"Example:
python process_category.py "https://www.auchan.ro/lactate-carne-mezeluri---peste/lactate/iaurt/c" --category-id "2030300"The script performs the following steps:
- Collects product links from the specified category
- Scrapes detailed product information
- Saves data to CSV files
- Processes barcodes
- Maps specifications and nutritional information
- Saves the processed data
The script creates the following files in the auchan/{category}/{subcategory}/ directory:
product_links.csv: Contains all product URLs{subcategory}.csv: Raw scraped data{subcategory}_processed.csv: Processed data with mapped specificationsunmapped_columns.json: List of columns that couldn't be automatically mappedimages/: Directory containing product images
Here are some common category IDs for reference:
- Dairy Products (Lactate): 2030300
- Chocolate Bars (Batoane ciocolata): 3060100
- Mineral Water (Apa plata): 2010100
If you encounter any issues:
- Make sure you have the correct category ID
- Check that the category URL is valid and accessible
- Ensure you have all required dependencies installed
- Check the network connection and Auchan's website availability