AI-powered web scraping with intelligent extraction β Cloud or Local
Transform any website into structured data using Playwright automation and LLM-powered extraction. Built for modern web applications, RAG pipelines, and data workflows. Supports both cloud (OpenAI) and local LLMs (Ollama, vLLM, etc.) for complete data privacy.
- π€ LLM Extraction - Convert web content to structured JSON using OpenAI or local models
- π¦ Batch Processing - Process multiple URLs efficiently with controlled concurrency
- 𧬠API-first - REST endpoints secured with API keys, documented with Swagger.
- π Browser Automation - Full Playwright support with stealth mode
- π Multiple Formats - Output as HTML, Markdown, or plain text
- π₯ Download Options - Individual files, ZIP archives, or consolidated JSON
- β‘ Smart Caching - File-based caching with configurable TTL
- π Job Queue - Background processing with BullMQ and Redis
- π·οΈ Web Crawling - Multi-page crawling with configurable strategies
- π³ Docker Ready - One-command deployment
- π Local LLM Support - Run completely offline with Ollama, vLLM, LocalAI, or LiteLLM
- π Privacy First - Keep your data processing entirely on-premises
git clone https://github.com/stretchcloud/deepscrape.git
cd deepscrape
npm install
cp .env.example .envEdit .env with your settings:
API_KEY=your-secret-key
# Option 1: Use OpenAI (cloud)
LLM_PROVIDER=openai
OPENAI_API_KEY=your-openai-key
# Option 2: Use local model (e.g., Ollama)
# LLM_PROVIDER=ollama
# LLM_MODEL=llama3:latest
REDIS_HOST=localhost
CACHE_ENABLED=truenpm run devTest: curl http://localhost:3000/health
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://example.com",
"options": { "extractorFormat": "markdown" }
}' | jq -r '.content' > content.mdExtract structured data using JSON Schema:
curl -X POST http://localhost:3000/api/extract-schema \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://news.example.com/article",
"schema": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Article headline"
},
"author": {
"type": "string",
"description": "Author name"
},
"publishDate": {
"type": "string",
"description": "Publication date"
}
},
"required": ["title"]
}
}' | jq -r '.extractedData' > schemadata.mdScrapes a URL and uses an LLM to generate a concise summary of its content. Works with both OpenAI and local models.
curl -X POST http://localhost:3000/api/summarize \
-H "Content-Type: application/json" \
-H "X-API-Key: test-key" \
-d '{
"url": "https://en.wikipedia.org/wiki/Large_language_model",
"maxLength": 300,
"options": {
"temperature": 0.3,
"waitForSelector": "body",
"extractorFormat": "markdown"
}
}' | jq -r '.summary' > summary-output.mdExtract key information from technical documentation:
curl -X POST http://localhost:3000/api/extract-schema \
-H "Content-Type: application/json" \
-H "X-API-Key: test-key" \
-d '{
"url": "https://docs.github.com/en/rest/overview/permissions-required-for-github-apps",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"overview": {"type": "string"},
"permissionCategories": {"type": "array", "items": {"type": "string"}},
"apiEndpoints": {
"type": "array",
"items": {
"type": "object",
"properties": {
"endpoint": {"type": "string"},
"requiredPermissions": {"type": "array", "items": {"type": "string"}}
}
}
}
},
"required": ["title", "overview"]
},
"options": {
"extractorFormat": "markdown"
}
}' | jq -r '.extractedData' > output.mdExtract and compare methodologies from research papers:
curl -X POST http://localhost:3000/api/extract-schema \
-H "Content-Type: application/json" \
-H "X-API-Key: test-key" \
-d '{
"url": "https://arxiv.org/abs/2005.14165",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"authors": {"type": "array", "items": {"type": "string"}},
"abstract": {"type": "string"},
"methodology": {"type": "string"},
"results": {"type": "string"},
"keyContributions": {"type": "array", "items": {"type": "string"}},
"citations": {"type": "number"}
}
},
"options": {
"extractorFormat": "markdown"
}
}' | jq -r '.extractedData' > output.mdExtract complex data structure from any medium articles
curl -X POST http://localhost:3000/api/extract-schema \
-H "Content-Type: application/json" \
-H "X-API-Key: test-key" \
-d '{
"url": "https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"keyInsights": {"type": "array", "items": {"type": "string"}},
"technicalChallenges": {"type": "array", "items": {"type": "string"}},
"businessImpact": {"type": "string"}
}
},
"options": {
"extractorFormat": "markdown"
}
}' | jq -r '.extractedData' > output.mdProcess multiple URLs efficiently with controlled concurrency, automatic retries, and comprehensive download options.
curl -X POST http://localhost:3000/api/batch/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"urls": [
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/overview",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/vertex-ai-studio-express-mode-quickstart",
"https://cloud.google.com/vertex-ai/generative-ai/docs/start/express-mode/vertex-ai-express-mode-api-quickstart"
],
"concurrency": 3,
"options": {
"extractorFormat": "markdown",
"waitForTimeout": 2000,
"stealthMode": true
}
}'Response:
{
"success": true,
"batchId": "550e8400-e29b-41d4-a716-446655440000",
"totalUrls": 5,
"estimatedTime": 50000,
"statusUrl": "http://localhost:3000/api/batch/scrape/550e8400.../status"
}curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \
-H "X-API-Key: your-secret-key"Response:
{
"success": true,
"batchId": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"totalUrls": 5,
"completedUrls": 4,
"failedUrls": 1,
"progress": 100,
"processingTime": 45230,
"results": [...]
}# Download all results as markdown files in a ZIP
curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown" \
-H "X-API-Key: your-secret-key" \
--output "batch_results.zip"
# Extract the ZIP to get individual files
unzip batch_results.zipZIP Contents:
1_example_com_page1.md
2_example_com_page2.md
3_example_com_page3.md
4_docs_example_com_api.md
batch_summary.json
# Get job IDs from status endpoint, then download individual files
curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/{jobId}?format=markdown" \
-H "X-API-Key: your-secret-key" \
--output "page1.md"# All results in a single JSON file
curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/json" \
-H "X-API-Key: your-secret-key" \
--output "batch_results.json"curl -X POST http://localhost:3000/api/batch/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"urls": ["https://example.com", "https://example.org"],
"concurrency": 5,
"timeout": 300000,
"maxRetries": 3,
"failFast": false,
"webhook": "https://your-app.com/webhook",
"options": {
"extractorFormat": "markdown",
"useBrowser": true,
"stealthMode": true,
"waitForTimeout": 5000,
"blockAds": true,
"actions": [
{"type": "click", "selector": ".accept-cookies", "optional": true},
{"type": "wait", "timeout": 2000}
]
}
}'curl -X DELETE http://localhost:3000/api/batch/scrape/{batchId} \
-H "X-API-Key: your-secret-key"Start a multi-page crawl (automatically exports markdown files):
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://docs.example.com",
"limit": 50,
"maxDepth": 3,
"strategy": "bfs",
"includePaths": ["^/docs/.*"],
"scrapeOptions": {
"extractorFormat": "markdown"
}
}'Response includes output directory:
{
"success": true,
"id": "abc123-def456",
"url": "http://localhost:3000/api/crawl/abc123-def456",
"message": "Crawl initiated successfully. Individual pages will be exported as markdown files.",
"outputDirectory": "./crawl-output/abc123-def456"
}Check crawl status (includes exported files info):
curl http://localhost:3000/api/crawl/{job-id} \
-H "X-API-Key: your-secret-key"Status response shows exported files:
{
"success": true,
"status": "completed",
"crawl": {...},
"jobs": [...],
"count": 15,
"exportedFiles": {
"count": 15,
"outputDirectory": "./crawl-output/abc123-def456",
"files": ["./crawl-output/abc123-def456/2024-01-15_abc123_example.com_page1.md", ...]
}
}| Endpoint | Method | Description |
|---|---|---|
/api/scrape |
POST | Scrape single URL |
/api/extract-schema |
POST | Extract structured data |
/api/summarize |
POST | Generate content summary |
/api/batch/scrape |
POST | Start batch processing |
/api/batch/scrape/:id/status |
GET | Get batch status |
/api/batch/scrape/:id/download/zip |
GET | Download batch as ZIP |
/api/batch/scrape/:id/download/json |
GET | Download batch as JSON |
/api/batch/scrape/:id/download/:jobId |
GET | Download individual result |
/api/batch/scrape/:id |
DELETE | Cancel batch processing |
/api/crawl |
POST | Start web crawl |
/api/crawl/:id |
GET | Get crawl status |
/api/cache |
DELETE | Clear cache |
# Core
API_KEY=your-secret-key
PORT=3000
# LLM Configuration
LLM_PROVIDER=openai # or ollama, vllm, localai, litellm
# For OpenAI
OPENAI_API_KEY=your-key
OPENAI_MODEL=gpt-4o
# For Local Models
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=llama3:latest
LLM_TEMPERATURE=0.2
# Cache
CACHE_ENABLED=true
CACHE_TTL=3600
CACHE_DIRECTORY=./cache
# Redis (for job queue)
REDIS_HOST=localhost
REDIS_PORT=6379
# Crawl file export
CRAWL_OUTPUT_DIR=./crawl-outputinterface ScraperOptions {
extractorFormat?: 'html' | 'markdown' | 'text'
waitForSelector?: string
waitForTimeout?: number
actions?: BrowserAction[] // click, scroll, wait, fill
skipCache?: boolean
cacheTtl?: number
stealthMode?: boolean
proxy?: string
userAgent?: string
}# Build and run
docker build -t deepscrape .
docker run -d -p 3000:3000 --env-file .env deepscrape
# Or use docker-compose
docker-compose up -dDeepScrape supports local LLM models through Ollama, vLLM, LocalAI, and other OpenAI-compatible servers. This allows you to run extraction entirely on your own hardware without external API calls.
- Update your
.envfile:
# Switch from OpenAI to Ollama
LLM_PROVIDER=ollama
LLM_BASE_URL=http://ollama:11434/v1
LLM_MODEL=llama3:latest # or qwen:7b, mistral, etc.- Start Ollama with Docker:
# For macOS/Linux without GPU
docker-compose -f docker-compose.yml -f docker-compose.llm.yml -f docker/llm-providers/docker-compose.ollama-mac.yml up -d
# The first run will automatically pull your model- Verify it's working:
# Test the LLM provider by making an API call
curl -X POST http://localhost:3000/api/summarize \
-H "Content-Type: application/json" \
-H "X-API-Key: test-key" \
-d '{"url": "https://example.com", "maxLength": 300}'| Provider | Best For | Docker Command |
|---|---|---|
| Ollama | Easy setup, many models | make llm-ollama |
| vLLM | High performance (GPU) | make llm-vllm |
| LocalAI | CPU inference | make llm-localai |
| LiteLLM | Multiple providers | make llm-litellm |
# List available models
docker exec deepscrape-ollama ollama list
# Pull a new model
docker exec deepscrape-ollama ollama pull llama3:70b
# Remove a model
docker exec deepscrape-ollama ollama rm llama3:70b- Small models (1-7B params): Good for summaries and simple extraction
- Medium models (7-13B params): Better for complex schemas
- Large models (70B+ params): Best quality but slower
If extraction seems slow or hangs:
# Check container logs
docker logs deepscrape-ollama
docker logs deepscrape-app
# Monitor resource usage
docker stats
# Clear cache if needed
docker exec deepscrape-app sh -c "rm -rf /app/cache/*"See docs/LLM_PROVIDERS.md for detailed configuration options.
Provider-specific configurations are stored in the config/ directory:
config/
βββ litellm/ # LiteLLM proxy configurations
β βββ config.yaml # Routes and provider settings
βββ localai/ # LocalAI model configurations
βββ gpt4all-j.yaml # Example model configuration
To add custom models:
- LocalAI: Create a YAML file in
config/localai/with model parameters - LiteLLM: Edit
config/litellm/config.yamlto add new routes or providers
DeepScrape can run entirely on your infrastructure without any external API calls:
- Local LLMs: Process sensitive data using on-premises models
- No Data Leakage: Your scraped content never leaves your network
- Compliance Ready: Perfect for GDPR, HIPAA, or other regulatory requirements
- Air-gapped Operation: Can run completely offline once models are downloaded
# Configure for local processing
export LLM_PROVIDER=ollama
export LLM_MODEL=llama3:latest
# Scrape internal documents
curl -X POST http://localhost:3000/api/extract-schema \
-H "X-API-Key: your-key" \
-d '{
"url": "https://internal.company.com/confidential-report",
"schema": {
"type": "object",
"properties": {
"classification": {"type": "string"},
"summary": {"type": "string"},
"keyFindings": {"type": "array", "items": {"type": "string"}}
}
}
}'Interact with dynamic content:
{
"url": "https://example.com",
"options": {
"actions": [
{ "type": "click", "selector": "#load-more" },
{ "type": "wait", "timeout": 2000 },
{ "type": "scroll", "position": 1000 }
]
}
}- BFS (default) - Breadth-first exploration
- DFS - Depth-first for deep content
- Best-First - Priority-based on content relevance
- Use clear
descriptionfields in your JSON Schema - Start with simple schemas and iterate
- Lower
temperaturevalues for consistent results - Include examples in descriptions for better accuracy
Each crawled page is automatically exported as a markdown file with:
- Filename format:
YYYY-MM-DD_crawlId_hostname_path.md - YAML frontmatter with metadata (URL, title, crawl date, status)
- Organized structure:
./crawl-output/{crawl-id}/ - Automatic summary: Generated when crawl completes
Example file structure:
crawl-output/
βββ abc123-def456/
β βββ 2024-01-15_abc123_docs.example.com_getting-started.md
β βββ 2024-01-15_abc123_docs.example.com_api-reference.md
β βββ 2024-01-15_abc123_docs.example.com_tutorials.md
β βββ abc123-def456_summary.md
β βββ abc123-def456_consolidated.md # π All pages in one file
β βββ abc123-def456_consolidated.json # π Structured JSON export
βββ xyz789-ghi012/
βββ ...
Consolidated Export Features:
- Single Markdown: All crawled pages combined into one readable file
- JSON Export: Structured data with metadata for programmatic use
- Auto-Generated: Created automatically when crawl completes
- Rich Metadata: Preserves all page metadata and crawl statistics
File content example:
---
url: "https://docs.example.com/getting-started"
title: "Getting Started Guide"
crawled_at: "2024-01-15T10:30:00.000Z"
status: 200
content_type: "markdown"
load_time: 1250ms
browser_mode: false
---
# Getting Started Guide
Welcome to the getting started guide...βββββββββββββββββ REST ββββββββββββββββββββββββββ
β Client ββββββββββββββΆβ Express API Gateway β
βββββββββββββββββ ββββββββββ¬ββββββββββββββββ
β (Job Payload)
βΌ
βββββββββββββββββββββββββ
β BullMQ Job Queue β (Redis)
ββββββββββ¬βββββββββββββββ
β
pulls job β pushes result
βΌ
βββββββββββββββββββ Playwright βββββββββββββββββββ LLM ββββββββββββββββ
β Scraper Worker ββββββββββββΆβ Extractor ββββββββββΆβ OpenAI/Local β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ
(Headless Browser) (HTML β MD/Text/JSON) (Cloud or On-Prem)
β
βΌ
Cache Layer (FS/Redis)
- π¦ Batch processing with controlled concurrency
- π₯ Multiple download formats (ZIP, JSON, individual files)
- πΈ Browser pooling & warm-up
- π§ Automatic schema generation (LLM)
- π Prometheus metrics & Grafana dashboard
- π Cloud-native cache backends (S3/Redis)
- π Local LLM support (Ollama, vLLM, LocalAI)
- π Web UI playground
- π Advanced webhook payloads with retry logic
- π Batch processing analytics and insights
- π€ Auto-select best LLM based on task complexity
Apache 2.0 - see LICENSE file
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Star β this repo if you find it useful!