Food Facts Data Collection

A Python-based data collection and processing pipeline for food products from Auchan Romania. The project scrapes product information, processes images for barcodes, and stores the data in a structured format in Supabase.

Project Structure

food_facts/
├── processors/
│   ├── barcodes/           # Barcode processing modules
│   ├── helpers/            # Helper functions and utilities
│   ├── scraper/            # Web scraping modules
│   └── supabase/          # Supabase integration
├── auchan/                 # Scraped data storage
│   └── [category]/
│       ├── images/        # Product images
│       ├── product_links.csv
│       └── [category]_processed.csv
├── process_category.py     # Main processing script
└── requirements.txt        # Project dependencies

Features

Automated product data collection from Auchan Romania
Image downloading and barcode extraction
Specification and nutritional information mapping
Data cleaning and validation
Supabase integration for data storage
Duplicate product detection
Case-insensitive product name matching

Prerequisites

Python 3.8+
OpenCV
pyzbar
Supabase account and credentials

Installation

Clone the repository:

git clone [repository-url]
cd food_facts

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables: Create a .env file in the project root with:

SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key

Usage

Finding Category ID

To find a category ID:

Open the desired category page on Auchan's website (e.g., https://www.auchan.ro/lactate-carne-mezeluri---peste/lactate/iaurt/c)
Open your browser's Developer Tools (F12 or right-click -> Inspect)
Go to the "Network" tab
In the search/filter box, type: route
Look for a request that contains: "domain":"store","id":"store.search#subcategory","params":{"id":
The number after "id": is your category ID

Example:

{
  "route": {
    "domain": "store",
    "id": "store.search#subcategory",
    "params": {
      "id": "2030300"  // This is your category ID
    }
  }
}

Running the Script

To process a category, use the following command:

python process_category.py "CATEGORY_URL" --category-id "CATEGORY_ID"

Example:

python process_category.py "https://www.auchan.ro/lactate-carne-mezeluri---peste/lactate/iaurt/c" --category-id "2030300"

What the Script Does

The script performs the following steps:

Collects product links from the specified category
Scrapes detailed product information
Saves data to CSV files
Processes barcodes
Maps specifications and nutritional information
Saves the processed data

Output

The script creates the following files in the auchan/{category}/{subcategory}/ directory:

product_links.csv: Contains all product URLs
{subcategory}.csv: Raw scraped data
{subcategory}_processed.csv: Processed data with mapped specifications
unmapped_columns.json: List of columns that couldn't be automatically mapped
images/: Directory containing product images

Common Category IDs

Here are some common category IDs for reference:

Dairy Products (Lactate): 2030300
Chocolate Bars (Batoane ciocolata): 3060100
Mineral Water (Apa plata): 2010100

Troubleshooting

If you encounter any issues:

Make sure you have the correct category ID
Check that the category URL is valid and accessible
Ensure you have all required dependencies installed
Check the network connection and Auchan's website availability

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
ingredients		ingredients
processors		processors
tests		tests
.gitignore		.gitignore
README.md		README.md
category_check_list.xlsx		category_check_list.xlsx
config.env.example		config.env.example
ingredients.csv		ingredients.csv
process_category.py		process_category.py
process_products_without_score.py		process_products_without_score.py
process_single_product.py		process_single_product.py
recalculate_scores.py		recalculate_scores.py
reparse_ingredients.py		reparse_ingredients.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Food Facts Data Collection

Project Structure

Features

Prerequisites

Installation

Usage

Finding Category ID

Running the Script

What the Script Does

Output

Common Category IDs

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mmrshk/purio_python

Folders and files

Latest commit

History

Repository files navigation

Food Facts Data Collection

Project Structure

Features

Prerequisites

Installation

Usage

Finding Category ID

Running the Script

What the Script Does

Output

Common Category IDs

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages