Amazon Product Scraper with Elasticsearch

A hackathon showcase demonstrating AI-powered web scraping with ScrapeGraphAI and advanced data analytics using Elasticsearch.

This project scrapes Amazon product data (PC components: CPUs, GPUs, RAM, etc.) and demonstrates powerful Elasticsearch features including full-text search, aggregations, filtering, and real-time analytics.

🎯 What This Demo Shows

ScrapeGraphAI SDK Features

AI-Powered Scraping: Extract structured product data from Amazon using cloud-based AI
Async/Parallel Execution: Scrape multiple pages concurrently for high performance
Robust Data Extraction: Automatically extract product names, prices, ratings, reviews, and availability
No Manual Parsing: Define what you want in plain English - AI handles the extraction

Elasticsearch Capabilities

Full-Text Search: Find products using natural language queries
Advanced Filtering: Filter by price range, category, rating, Prime availability
Aggregations: Compute statistics (avg price, top brands, price distributions)
Real-Time Indexing: Store and query scraped data instantly
Complex Queries: Combine multiple filters and sorting criteria

⚡ Quick Start

Prerequisites

Python 3.8+
Docker & Docker Compose (for Elasticsearch)
ScrapeGraphAI API Key - Already included in the script! (or set your own via SGAI_API_KEY environment variable)

1. Install Python Dependencies

pip install -r requirements.txt

Required packages:

scrapegraph-py - ScrapeGraphAI SDK for AI-powered scraping
elasticsearch - Elasticsearch Python client
pydantic - Data validation and modeling
python-dotenv - Environment configuration

2. Start Elasticsearch

# Start Elasticsearch and Kibana containers
docker compose up -d
# Or if using standalone docker-compose: docker-compose up -d

# Wait 30-60 seconds for startup, then verify:
curl http://localhost:9200/_cluster/health

Services:

Elasticsearch: http://localhost:9200
Kibana (visualization): http://localhost:5601

3. Run the Scraper

python amazon_keyboard_scraper.py

That's it! The script will:

Scrape 80 pages of Amazon PC component data (8 categories × 10 pages)
Extract product details using AI
Store everything in Elasticsearch
Run 7 analytical queries demonstrating Elasticsearch capabilities

📊 What Gets Scraped

The script scrapes 8 PC component categories from Amazon Italy:

💻 CPUs (Processors)
🎮 GPUs (Graphics Cards)
🧠 RAM (Memory)
🔌 Motherboards
💾 SSDs (Storage)
⚡ Power Supplies
📦 PC Cases
❄️ CPU Coolers

10 pages per category = 80 pages total (~800-1000 products)

Extracted Data Fields

For each product, the AI extracts:

Name: Product title
Price: Price in EUR
Rating: Star rating (0-5)
Review Count: Number of customer reviews
Prime: Amazon Prime availability
URL: Product link
ASIN: Amazon product ID
Category: Component type

🔍 Elasticsearch Queries Demonstrated

After scraping, the script automatically runs 7 analytical queries to showcase Elasticsearch capabilities:

1. Top-Rated Products

Find highest-rated products using sorting and filtering:

{
  "query": {"range": {"rating": {"gte": 4.0}}},
  "sort": [{"rating": "desc"}, {"review_count": "desc"}]
}

2. Most-Reviewed Products

Identify popular products by review volume:

{
  "sort": [{"review_count": {"order": "desc"}}]
}

3. Price Distribution

Use aggregations to analyze price ranges:

{
  "aggs": {
    "price_stats": {"stats": {"field": "price"}},
    "price_histogram": {"histogram": {"field": "price", "interval": 25}}
  }
}

4. Prime vs Non-Prime Comparison

Compare product segments with term filters and aggregations:

{
  "query": {"term": {"availability": "Prime"}},
  "aggs": {
    "avg_price": {"avg": {"field": "price"}},
    "avg_rating": {"avg": {"field": "rating"}}
  }
}

5. Products by Price Range

Categorize products using range queries:

{
  "query": {"range": {"price": {"gte": 30, "lt": 60}}}
}

6. Top Brands

Use terms aggregation with sub-aggregations:

{
  "aggs": {
    "brands": {
      "terms": {"field": "brand.keyword"},
      "aggs": {
        "avg_rating": {"avg": {"field": "rating"}},
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

7. Best Value Products

Combine multiple range filters with bool queries:

{
  "query": {
    "bool": {
      "must": [
        {"range": {"rating": {"gte": 4.5}}},
        {"range": {"price": {"lt": 100}}}
      ]
    }
  }
}

📁 Project Structure

scrapegraph-elasticsearch-demo/
├── amazon_keyboard_scraper.py     # Main scraper script (async with parallel execution)
├── src/scrapegraph_demo/
│   ├── config.py                  # Configuration management
│   ├── models.py                  # Pydantic data models (Product)
│   ├── elasticsearch_client.py    # Elasticsearch operations
│   └── __init__.py                # Package exports
├── docker-compose.yml             # Elasticsearch + Kibana setup
├── requirements.txt               # Python dependencies
├── .env.example                   # Environment configuration template
└── README.md                      # This file

Key Files

amazon_keyboard_scraper.py - Main script featuring:

Async/parallel page scraping for performance
ScrapeGraphAI API integration via scrapegraph-py SDK
Elasticsearch bulk indexing
7 demonstration queries
Progress tracking and error handling

src/scrapegraph_demo/elasticsearch_client.py - Elasticsearch wrapper:

Index creation with optimized mapping
Bulk indexing operations
Search methods with filters
Aggregation queries
Statistics calculation

src/scrapegraph_demo/models.py - Data models:

Product - Pydantic model for type-safe product data
Validation and serialization
Elasticsearch document conversion

🎨 Visualizing Data with Kibana

After running the scraper, explore your data visually:

Open Kibana: http://localhost:5601
Create Index Pattern:
- Go to Management → Stack Management → Index Patterns
- Create pattern: marketplace_products
- Select timestamp field: scraped_at
Explore in Discover: Browse all scraped products
Create Visualizations:
- Pie Chart: Product distribution by brand
- Histogram: Price distribution
- Metric Cards: Average rating, total products
- Data Table: Top products by reviews
- Bar Chart: Products per category

Example Kibana Queries

Find all GPUs under €500:

category: "Gpu" AND price: [* TO 500]

Prime products with 4+ star ratings:

availability: "Prime" AND rating: [4 TO *]

⚙️ Configuration

API Key

The script includes a default API key for convenience. To use your own:

export SGAI_API_KEY=your-api-key-here
# Then run: python amazon_keyboard_scraper.py

Get your API key at scrapegraphai.com

Environment Variables (Optional)

Create a .env file for custom configuration:

# ScrapeGraphAI
SGAI_API_KEY=your-api-key-here

# Elasticsearch (defaults shown)
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http

📊 Elasticsearch Index Schema

Index: marketplace_products

Field	Type	Purpose	Example
`product_id`	keyword	Unique identifier (ASIN)	`B08N5WRWNW`
`name`	text + keyword	Product name (searchable)	`AMD Ryzen 9 5900X`
`price`	float	Price for range queries	`449.99`
`currency`	keyword	Price currency	`EUR`
`marketplace`	keyword	Source marketplace	`Amazon IT`
`category`	keyword	Component type	`CPU`
`brand`	text + keyword	Brand (searchable + aggregatable)	`AMD`
`rating`	float	Star rating	`4.8`
`review_count`	integer	Number of reviews	`3521`
`availability`	keyword	Prime or Standard	`Prime`
`url`	keyword	Product URL	`https://amazon.it/...`
`specifications`	object	Additional metadata	`{prime_eligible: true}`
`scraped_at`	date	Timestamp	`2024-01-15T10:30:00Z`

Optimized for:

Full-text search on name and brand
Exact matching on category, marketplace, availability
Range queries on price, rating
Aggregations on brand.keyword, category

⚡ Performance

Async/Parallel Execution:

Pages within each component scrape in parallel
Components process sequentially
~80 pages typically complete in 2-3 minutes

Expected Results:

800-1000 total products
~100-125 products per component category
Some pages may fail (network issues, rate limiting) - this is normal

💡 Hackathon Ideas & Extensions

This project demonstrates the foundation for many applications:

🛒 E-Commerce Applications

Price Monitoring: Track price changes over time
Stock Alerts: Notify when products become available
Price Comparison: Find the best deals across categories
Market Analysis: Identify pricing trends and patterns

🔍 Search Enhancements

Recommendation Engine: "Find similar products"
Smart Filters: Multi-dimensional product filtering
Personalized Results: User preference-based ranking

📊 Analytics Dashboards

Brand Analysis: Market share and pricing strategies
Category Insights: Popular products per category
Prime Impact: How Prime affects pricing and ratings
Review Correlation: Relationship between reviews and ratings

🤖 ML Integration

Price Prediction: Forecast future price trends
Sentiment Analysis: Analyze review text (extend scraping)
Product Clustering: Group similar products automatically
Anomaly Detection: Find unusual pricing or ratings

🐛 Troubleshooting

Elasticsearch Won't Start

# Check if containers are running
docker compose ps

# View logs
docker compose logs elasticsearch

# Restart services
docker compose down && docker compose up -d

Note: Use docker-compose (with hyphen) if you have the standalone version installed.

Script Errors

Import errors: Make sure you installed dependencies:

pip install -r requirements.txt

API errors: The script will log errors but continue scraping remaining pages. Check:

Internet connection
API rate limits (wait a few minutes and retry)
Check logs for specific error messages

No products found: Some pages may be empty or fail to scrape. This is normal - the script handles failures gracefully and continues.

Slow Performance

Network speed affects scraping time
Reduce PAGES_PER_COMPONENT in the script for faster testing
Some API rate limiting is expected

🔗 Resources

ScrapeGraphAI: scrapegraphai.com
ScrapeGraphAI SDK: github.com/ScrapeGraphAI/scrapegraph-sdk
Elasticsearch Docs: elastic.co/guide
Pydantic: docs.pydantic.dev

📄 License

This project is provided for demonstration and educational purposes.

Built for hackathons with ❤️
Showcasing ScrapeGraphAI + Elasticsearch

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src/scrapegraph_demo		src/scrapegraph_demo
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
amazon_keyboard_scraper.py		amazon_keyboard_scraper.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

License

ScrapeGraphAI/scrapegraph-elasticsearch-demo

Folders and files

Latest commit

History

Repository files navigation