A hackathon showcase demonstrating AI-powered web scraping with ScrapeGraphAI and advanced data analytics using Elasticsearch.
This project scrapes Amazon product data (PC components: CPUs, GPUs, RAM, etc.) and demonstrates powerful Elasticsearch features including full-text search, aggregations, filtering, and real-time analytics.
- AI-Powered Scraping: Extract structured product data from Amazon using cloud-based AI
- Async/Parallel Execution: Scrape multiple pages concurrently for high performance
- Robust Data Extraction: Automatically extract product names, prices, ratings, reviews, and availability
- No Manual Parsing: Define what you want in plain English - AI handles the extraction
- Full-Text Search: Find products using natural language queries
- Advanced Filtering: Filter by price range, category, rating, Prime availability
- Aggregations: Compute statistics (avg price, top brands, price distributions)
- Real-Time Indexing: Store and query scraped data instantly
- Complex Queries: Combine multiple filters and sorting criteria
- Python 3.8+
- Docker & Docker Compose (for Elasticsearch)
- ScrapeGraphAI API Key - Already included in the script! (or set your own via
SGAI_API_KEYenvironment variable)
pip install -r requirements.txtRequired packages:
scrapegraph-py- ScrapeGraphAI SDK for AI-powered scrapingelasticsearch- Elasticsearch Python clientpydantic- Data validation and modelingpython-dotenv- Environment configuration
# Start Elasticsearch and Kibana containers
docker compose up -d
# Or if using standalone docker-compose: docker-compose up -d
# Wait 30-60 seconds for startup, then verify:
curl http://localhost:9200/_cluster/healthServices:
- Elasticsearch: http://localhost:9200
- Kibana (visualization): http://localhost:5601
python amazon_keyboard_scraper.pyThat's it! The script will:
- Scrape 80 pages of Amazon PC component data (8 categories × 10 pages)
- Extract product details using AI
- Store everything in Elasticsearch
- Run 7 analytical queries demonstrating Elasticsearch capabilities
The script scrapes 8 PC component categories from Amazon Italy:
- 💻 CPUs (Processors)
- 🎮 GPUs (Graphics Cards)
- 🧠 RAM (Memory)
- 🔌 Motherboards
- 💾 SSDs (Storage)
- ⚡ Power Supplies
- 📦 PC Cases
- ❄️ CPU Coolers
10 pages per category = 80 pages total (~800-1000 products)
For each product, the AI extracts:
- Name: Product title
- Price: Price in EUR
- Rating: Star rating (0-5)
- Review Count: Number of customer reviews
- Prime: Amazon Prime availability
- URL: Product link
- ASIN: Amazon product ID
- Category: Component type
After scraping, the script automatically runs 7 analytical queries to showcase Elasticsearch capabilities:
Find highest-rated products using sorting and filtering:
{
"query": {"range": {"rating": {"gte": 4.0}}},
"sort": [{"rating": "desc"}, {"review_count": "desc"}]
}Identify popular products by review volume:
{
"sort": [{"review_count": {"order": "desc"}}]
}Use aggregations to analyze price ranges:
{
"aggs": {
"price_stats": {"stats": {"field": "price"}},
"price_histogram": {"histogram": {"field": "price", "interval": 25}}
}
}Compare product segments with term filters and aggregations:
{
"query": {"term": {"availability": "Prime"}},
"aggs": {
"avg_price": {"avg": {"field": "price"}},
"avg_rating": {"avg": {"field": "rating"}}
}
}Categorize products using range queries:
{
"query": {"range": {"price": {"gte": 30, "lt": 60}}}
}Use terms aggregation with sub-aggregations:
{
"aggs": {
"brands": {
"terms": {"field": "brand.keyword"},
"aggs": {
"avg_rating": {"avg": {"field": "rating"}},
"avg_price": {"avg": {"field": "price"}}
}
}
}
}Combine multiple range filters with bool queries:
{
"query": {
"bool": {
"must": [
{"range": {"rating": {"gte": 4.5}}},
{"range": {"price": {"lt": 100}}}
]
}
}
}scrapegraph-elasticsearch-demo/
├── amazon_keyboard_scraper.py # Main scraper script (async with parallel execution)
├── src/scrapegraph_demo/
│ ├── config.py # Configuration management
│ ├── models.py # Pydantic data models (Product)
│ ├── elasticsearch_client.py # Elasticsearch operations
│ └── __init__.py # Package exports
├── docker-compose.yml # Elasticsearch + Kibana setup
├── requirements.txt # Python dependencies
├── .env.example # Environment configuration template
└── README.md # This file
amazon_keyboard_scraper.py - Main script featuring:
- Async/parallel page scraping for performance
- ScrapeGraphAI API integration via
scrapegraph-pySDK - Elasticsearch bulk indexing
- 7 demonstration queries
- Progress tracking and error handling
src/scrapegraph_demo/elasticsearch_client.py - Elasticsearch wrapper:
- Index creation with optimized mapping
- Bulk indexing operations
- Search methods with filters
- Aggregation queries
- Statistics calculation
src/scrapegraph_demo/models.py - Data models:
Product- Pydantic model for type-safe product data- Validation and serialization
- Elasticsearch document conversion
After running the scraper, explore your data visually:
- Open Kibana: http://localhost:5601
- Create Index Pattern:
- Go to Management → Stack Management → Index Patterns
- Create pattern:
marketplace_products - Select timestamp field:
scraped_at
- Explore in Discover: Browse all scraped products
- Create Visualizations:
- Pie Chart: Product distribution by brand
- Histogram: Price distribution
- Metric Cards: Average rating, total products
- Data Table: Top products by reviews
- Bar Chart: Products per category
Find all GPUs under €500:
category: "Gpu" AND price: [* TO 500]
Prime products with 4+ star ratings:
availability: "Prime" AND rating: [4 TO *]
The script includes a default API key for convenience. To use your own:
export SGAI_API_KEY=your-api-key-here
# Then run: python amazon_keyboard_scraper.pyGet your API key at scrapegraphai.com
Create a .env file for custom configuration:
# ScrapeGraphAI
SGAI_API_KEY=your-api-key-here
# Elasticsearch (defaults shown)
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=httpIndex: marketplace_products
| Field | Type | Purpose | Example |
|---|---|---|---|
product_id |
keyword | Unique identifier (ASIN) | B08N5WRWNW |
name |
text + keyword | Product name (searchable) | AMD Ryzen 9 5900X |
price |
float | Price for range queries | 449.99 |
currency |
keyword | Price currency | EUR |
marketplace |
keyword | Source marketplace | Amazon IT |
category |
keyword | Component type | CPU |
brand |
text + keyword | Brand (searchable + aggregatable) | AMD |
rating |
float | Star rating | 4.8 |
review_count |
integer | Number of reviews | 3521 |
availability |
keyword | Prime or Standard | Prime |
url |
keyword | Product URL | https://amazon.it/... |
specifications |
object | Additional metadata | {prime_eligible: true} |
scraped_at |
date | Timestamp | 2024-01-15T10:30:00Z |
Optimized for:
- Full-text search on
nameandbrand - Exact matching on
category,marketplace,availability - Range queries on
price,rating - Aggregations on
brand.keyword,category
Async/Parallel Execution:
- Pages within each component scrape in parallel
- Components process sequentially
- ~80 pages typically complete in 2-3 minutes
Expected Results:
- 800-1000 total products
- ~100-125 products per component category
- Some pages may fail (network issues, rate limiting) - this is normal
This project demonstrates the foundation for many applications:
- Price Monitoring: Track price changes over time
- Stock Alerts: Notify when products become available
- Price Comparison: Find the best deals across categories
- Market Analysis: Identify pricing trends and patterns
- Recommendation Engine: "Find similar products"
- Smart Filters: Multi-dimensional product filtering
- Personalized Results: User preference-based ranking
- Brand Analysis: Market share and pricing strategies
- Category Insights: Popular products per category
- Prime Impact: How Prime affects pricing and ratings
- Review Correlation: Relationship between reviews and ratings
- Price Prediction: Forecast future price trends
- Sentiment Analysis: Analyze review text (extend scraping)
- Product Clustering: Group similar products automatically
- Anomaly Detection: Find unusual pricing or ratings
# Check if containers are running
docker compose ps
# View logs
docker compose logs elasticsearch
# Restart services
docker compose down && docker compose up -dNote: Use docker-compose (with hyphen) if you have the standalone version installed.
Import errors: Make sure you installed dependencies:
pip install -r requirements.txtAPI errors: The script will log errors but continue scraping remaining pages. Check:
- Internet connection
- API rate limits (wait a few minutes and retry)
- Check logs for specific error messages
No products found: Some pages may be empty or fail to scrape. This is normal - the script handles failures gracefully and continues.
- Network speed affects scraping time
- Reduce
PAGES_PER_COMPONENTin the script for faster testing - Some API rate limiting is expected
- ScrapeGraphAI: scrapegraphai.com
- ScrapeGraphAI SDK: github.com/ScrapeGraphAI/scrapegraph-sdk
- Elasticsearch Docs: elastic.co/guide
- Pydantic: docs.pydantic.dev
This project is provided for demonstration and educational purposes.
Built for hackathons with ❤️
Showcasing ScrapeGraphAI + Elasticsearch