Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ScrapeGraphAI/scrapegraph-elasticsearch-demo

Repository files navigation

Amazon Product Scraper with Elasticsearch

A hackathon showcase demonstrating AI-powered web scraping with ScrapeGraphAI and advanced data analytics using Elasticsearch.

This project scrapes Amazon product data (PC components: CPUs, GPUs, RAM, etc.) and demonstrates powerful Elasticsearch features including full-text search, aggregations, filtering, and real-time analytics.

🎯 What This Demo Shows

ScrapeGraphAI SDK Features

  • AI-Powered Scraping: Extract structured product data from Amazon using cloud-based AI
  • Async/Parallel Execution: Scrape multiple pages concurrently for high performance
  • Robust Data Extraction: Automatically extract product names, prices, ratings, reviews, and availability
  • No Manual Parsing: Define what you want in plain English - AI handles the extraction

Elasticsearch Capabilities

  • Full-Text Search: Find products using natural language queries
  • Advanced Filtering: Filter by price range, category, rating, Prime availability
  • Aggregations: Compute statistics (avg price, top brands, price distributions)
  • Real-Time Indexing: Store and query scraped data instantly
  • Complex Queries: Combine multiple filters and sorting criteria

⚡ Quick Start

Prerequisites

  • Python 3.8+
  • Docker & Docker Compose (for Elasticsearch)
  • ScrapeGraphAI API Key - Already included in the script! (or set your own via SGAI_API_KEY environment variable)

1. Install Python Dependencies

pip install -r requirements.txt

Required packages:

  • scrapegraph-py - ScrapeGraphAI SDK for AI-powered scraping
  • elasticsearch - Elasticsearch Python client
  • pydantic - Data validation and modeling
  • python-dotenv - Environment configuration

2. Start Elasticsearch

# Start Elasticsearch and Kibana containers
docker compose up -d
# Or if using standalone docker-compose: docker-compose up -d

# Wait 30-60 seconds for startup, then verify:
curl http://localhost:9200/_cluster/health

Services:

3. Run the Scraper

python amazon_keyboard_scraper.py

That's it! The script will:

  1. Scrape 80 pages of Amazon PC component data (8 categories × 10 pages)
  2. Extract product details using AI
  3. Store everything in Elasticsearch
  4. Run 7 analytical queries demonstrating Elasticsearch capabilities

📊 What Gets Scraped

The script scrapes 8 PC component categories from Amazon Italy:

  • 💻 CPUs (Processors)
  • 🎮 GPUs (Graphics Cards)
  • 🧠 RAM (Memory)
  • 🔌 Motherboards
  • 💾 SSDs (Storage)
  • Power Supplies
  • 📦 PC Cases
  • ❄️ CPU Coolers

10 pages per category = 80 pages total (~800-1000 products)

Extracted Data Fields

For each product, the AI extracts:

  • Name: Product title
  • Price: Price in EUR
  • Rating: Star rating (0-5)
  • Review Count: Number of customer reviews
  • Prime: Amazon Prime availability
  • URL: Product link
  • ASIN: Amazon product ID
  • Category: Component type

🔍 Elasticsearch Queries Demonstrated

After scraping, the script automatically runs 7 analytical queries to showcase Elasticsearch capabilities:

1. Top-Rated Products

Find highest-rated products using sorting and filtering:

{
  "query": {"range": {"rating": {"gte": 4.0}}},
  "sort": [{"rating": "desc"}, {"review_count": "desc"}]
}

2. Most-Reviewed Products

Identify popular products by review volume:

{
  "sort": [{"review_count": {"order": "desc"}}]
}

3. Price Distribution

Use aggregations to analyze price ranges:

{
  "aggs": {
    "price_stats": {"stats": {"field": "price"}},
    "price_histogram": {"histogram": {"field": "price", "interval": 25}}
  }
}

4. Prime vs Non-Prime Comparison

Compare product segments with term filters and aggregations:

{
  "query": {"term": {"availability": "Prime"}},
  "aggs": {
    "avg_price": {"avg": {"field": "price"}},
    "avg_rating": {"avg": {"field": "rating"}}
  }
}

5. Products by Price Range

Categorize products using range queries:

{
  "query": {"range": {"price": {"gte": 30, "lt": 60}}}
}

6. Top Brands

Use terms aggregation with sub-aggregations:

{
  "aggs": {
    "brands": {
      "terms": {"field": "brand.keyword"},
      "aggs": {
        "avg_rating": {"avg": {"field": "rating"}},
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

7. Best Value Products

Combine multiple range filters with bool queries:

{
  "query": {
    "bool": {
      "must": [
        {"range": {"rating": {"gte": 4.5}}},
        {"range": {"price": {"lt": 100}}}
      ]
    }
  }
}

📁 Project Structure

scrapegraph-elasticsearch-demo/
├── amazon_keyboard_scraper.py     # Main scraper script (async with parallel execution)
├── src/scrapegraph_demo/
│   ├── config.py                  # Configuration management
│   ├── models.py                  # Pydantic data models (Product)
│   ├── elasticsearch_client.py    # Elasticsearch operations
│   └── __init__.py                # Package exports
├── docker-compose.yml             # Elasticsearch + Kibana setup
├── requirements.txt               # Python dependencies
├── .env.example                   # Environment configuration template
└── README.md                      # This file

Key Files

amazon_keyboard_scraper.py - Main script featuring:

  • Async/parallel page scraping for performance
  • ScrapeGraphAI API integration via scrapegraph-py SDK
  • Elasticsearch bulk indexing
  • 7 demonstration queries
  • Progress tracking and error handling

src/scrapegraph_demo/elasticsearch_client.py - Elasticsearch wrapper:

  • Index creation with optimized mapping
  • Bulk indexing operations
  • Search methods with filters
  • Aggregation queries
  • Statistics calculation

src/scrapegraph_demo/models.py - Data models:

  • Product - Pydantic model for type-safe product data
  • Validation and serialization
  • Elasticsearch document conversion

🎨 Visualizing Data with Kibana

After running the scraper, explore your data visually:

  1. Open Kibana: http://localhost:5601
  2. Create Index Pattern:
    • Go to Management → Stack Management → Index Patterns
    • Create pattern: marketplace_products
    • Select timestamp field: scraped_at
  3. Explore in Discover: Browse all scraped products
  4. Create Visualizations:
    • Pie Chart: Product distribution by brand
    • Histogram: Price distribution
    • Metric Cards: Average rating, total products
    • Data Table: Top products by reviews
    • Bar Chart: Products per category

Example Kibana Queries

Find all GPUs under €500:

category: "Gpu" AND price: [* TO 500]

Prime products with 4+ star ratings:

availability: "Prime" AND rating: [4 TO *]

⚙️ Configuration

API Key

The script includes a default API key for convenience. To use your own:

export SGAI_API_KEY=your-api-key-here
# Then run: python amazon_keyboard_scraper.py

Get your API key at scrapegraphai.com

Environment Variables (Optional)

Create a .env file for custom configuration:

# ScrapeGraphAI
SGAI_API_KEY=your-api-key-here

# Elasticsearch (defaults shown)
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http

📊 Elasticsearch Index Schema

Index: marketplace_products

Field Type Purpose Example
product_id keyword Unique identifier (ASIN) B08N5WRWNW
name text + keyword Product name (searchable) AMD Ryzen 9 5900X
price float Price for range queries 449.99
currency keyword Price currency EUR
marketplace keyword Source marketplace Amazon IT
category keyword Component type CPU
brand text + keyword Brand (searchable + aggregatable) AMD
rating float Star rating 4.8
review_count integer Number of reviews 3521
availability keyword Prime or Standard Prime
url keyword Product URL https://amazon.it/...
specifications object Additional metadata {prime_eligible: true}
scraped_at date Timestamp 2024-01-15T10:30:00Z

Optimized for:

  • Full-text search on name and brand
  • Exact matching on category, marketplace, availability
  • Range queries on price, rating
  • Aggregations on brand.keyword, category

⚡ Performance

Async/Parallel Execution:

  • Pages within each component scrape in parallel
  • Components process sequentially
  • ~80 pages typically complete in 2-3 minutes

Expected Results:

  • 800-1000 total products
  • ~100-125 products per component category
  • Some pages may fail (network issues, rate limiting) - this is normal

💡 Hackathon Ideas & Extensions

This project demonstrates the foundation for many applications:

🛒 E-Commerce Applications

  • Price Monitoring: Track price changes over time
  • Stock Alerts: Notify when products become available
  • Price Comparison: Find the best deals across categories
  • Market Analysis: Identify pricing trends and patterns

🔍 Search Enhancements

  • Recommendation Engine: "Find similar products"
  • Smart Filters: Multi-dimensional product filtering
  • Personalized Results: User preference-based ranking

📊 Analytics Dashboards

  • Brand Analysis: Market share and pricing strategies
  • Category Insights: Popular products per category
  • Prime Impact: How Prime affects pricing and ratings
  • Review Correlation: Relationship between reviews and ratings

🤖 ML Integration

  • Price Prediction: Forecast future price trends
  • Sentiment Analysis: Analyze review text (extend scraping)
  • Product Clustering: Group similar products automatically
  • Anomaly Detection: Find unusual pricing or ratings

🐛 Troubleshooting

Elasticsearch Won't Start

# Check if containers are running
docker compose ps

# View logs
docker compose logs elasticsearch

# Restart services
docker compose down && docker compose up -d

Note: Use docker-compose (with hyphen) if you have the standalone version installed.

Script Errors

Import errors: Make sure you installed dependencies:

pip install -r requirements.txt

API errors: The script will log errors but continue scraping remaining pages. Check:

  • Internet connection
  • API rate limits (wait a few minutes and retry)
  • Check logs for specific error messages

No products found: Some pages may be empty or fail to scrape. This is normal - the script handles failures gracefully and continues.

Slow Performance

  • Network speed affects scraping time
  • Reduce PAGES_PER_COMPONENT in the script for faster testing
  • Some API rate limiting is expected

🔗 Resources

📄 License

This project is provided for demonstration and educational purposes.


Built for hackathons with ❤️
Showcasing ScrapeGraphAI + Elasticsearch

About

Show case ScrapeGraphAI x Elastic Search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages