Thanks to visit codestin.com
Credit goes to github.com

Skip to content

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

License

Notifications You must be signed in to change notification settings

any4ai/AnyCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnyCrawl

AnyCrawl

AnyCrawl

Fast Scalable Web Crawling Site Crawling SERP Multi Threading Multi Process Batch Tasks

License: MIT PRs Welcome LLM Ready Documentation

X

Node.js TypeScript Redis

📖 Overview

AnyCrawl is a high‑performance crawling and scraping toolkit:

  • SERP crawling: multiple search engines, batch‑friendly
  • Web scraping: single‑page content extraction
  • Site crawling: full‑site traversal and collection
  • High performance: multi‑threading / multi‑process
  • Batch tasks: reliable and efficient
  • AI extraction: LLM‑powered structured data (JSON) extraction from pages

LLM‑friendly. Easy to integrate and use.

🚀 Quick Start

📖 See full docs: Docs

Generate an API Key (self-host)

If you enable authentication (ANYCRAWL_API_AUTH_ENABLED=true), generate an API key:

pnpm --filter api key:generate
# optionally name the key
pnpm --filter api key:generate -- default

The command prints uuid, key and credits. Use the printed key as a Bearer token.

Run Inside Docker

If running AnyCrawl via Docker:

  • Docker Compose:
docker compose exec api pnpm --filter api key:generate
docker compose exec api pnpm --filter api key:generate -- default
  • Single container (replace <container_name_or_id>):
docker exec -it <container_name_or_id> pnpm --filter api key:generate
docker exec -it <container_name_or_id> pnpm --filter api key:generate -- default

📚 Usage Examples

💡 Use the Playground to test APIs and generate code in your preferred language.

If self‑hosting, replace https://api.anycrawl.dev with your own server URL.

Web Scraping (Scrape)

Example

curl -X POST https://api.anycrawl.dev/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameters

Parameter Type Description Default
url string (required) The URL to be scraped. Must be a valid URL starting with http:// or https:// -
engine string Scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering with modern engine), puppeteer (JavaScript rendering with Chrome) cheerio
proxy string Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port (none)

More parameters: see Request Parameters.

LLM Extraction

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      }
    }
  }'

Site Crawling (Crawl)

Example

curl -X POST https://api.anycrawl.dev/v1/crawl \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "playwright",
  "max_depth": 2,
  "limit": 10,
  "strategy": "same-domain"
}'

Parameters

Parameter Type Description Default
url string (required) Starting URL to crawl -
engine string Crawling engine. Options: cheerio, playwright, puppeteer cheerio
max_depth number Max depth from the start URL 10
limit number Max number of pages to crawl 100
strategy enum Scope: all, same-domain, same-hostname, same-origin same-domain
include_paths array Only crawl paths matching these patterns (none)
exclude_paths array Skip paths matching these patterns (none)
scrape_options object Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options (none)

More parameters and endpoints: see Request Parameters.

Search Engine Results (SERP)

Example

curl -X POST https://api.anycrawl.dev/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameters

Parameter Type Description Default
query string (required) Search query to be executed -
engine string Search engine to use. Options: google google
pages integer Number of search result pages to retrieve 1
lang string Language code for search results (e.g., 'en', 'zh', 'all') en-US

Supported search engines

  • Google

❓ FAQ

  1. Can I use proxies? Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the proxy request parameter (per request) or ANYCRAWL_PROXY_URL (self‑hosting).
  2. How to handle JavaScript‑rendered pages? Use the Playwright or Puppeteer engines.

🤝 Contributing

We welcome contributions! See the Contributing Guide.

📄 License

MIT License — see LICENSE.

🎯 Mission

We build simple, reliable, and scalable tools for the AI ecosystem.


Built with ❤️ by the Any4AI team

About

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages