A web crawler is a program that starts from one webpage (called a seed URL) and visits all the links it finds on that page, then continues the process on each newly found page. It helps in collecting information, indexing websites, or analyzing website structures.
This is a fast and concurrent web crawler written in Rust. It starts from a user-specified URL, visits pages up to a set depth, and extracts all valid hyperlinks. The crawler avoids revisiting the same URL, respects concurrency limits, and stores the page data efficiently.
- Asynchronous Programming (
async/await,tokio): Handles multiple web requests concurrently without blocking threads. - Concurrency (
Arc,Mutex,Semaphore): Shares data safely across tasks and limits the number of parallel requests. - Type Safety and Error Handling (
Result,thiserror): Ensures reliable, predictable behavior even when requests fail. - Ownership and Borrowing: Prevents memory bugs and ensures thread safety at compile time.
- Modules and Crates: Project is modular, with responsibilities divided into separate files.
- Concurrent Task Lifecycle with
JoinSet— A robust architecture for managing the lifecycle of spawned crawl tasks (improved from a channel-based model for better task tracking and safety).
src/
│
├── main.rs # CLI entry point and crawler initialization
├── crawler.rs # Core logic for managing crawl workflow
├── fetcher.rs # Handles HTTP requests with delay control
├── parser.rs # Parses HTML and extracts links
├── storage.rs # Tracks visited URLs and stores page data
- Input: CLI arguments set start URL, depth, and concurrency.
- Initialize: Build a shared
CrawlerConfigand necessary components. - JoinSet Spawning: Start crawl tasks using
tokio::task::JoinSet. - Fetch: Request HTML using
reqwest. - Parse: Use
scraperto extract valid hyperlinks. - Track & Store: Save page content and mark URLs as visited.
- Queue New Tasks: If within depth limit, spawn new crawl tasks for discovered links.
- Complete: Print a structured summary once the crawl finishes.
- ✅ Asynchronous Architecture using
tokio - ✅ Concurrency Task Management using
tokio::task::JoinSet - ✅ Rate Limiting to avoid flooding target servers using
Semaphore - ✅ Skips duplicate links with
DashSet - ✅ Clean logging with
tracing - ✅ CLI interface using
clap - ✅ Modular code structure
git clone https://github.com/Rishi2333/rust-webcrawler.git
cd rust-webcrawlercargo build --releasecargo run --release -- --url https://example.com --depth 2 --concurrency 20--url: (Required) Starting point for the crawl--depth: (Optional) Max link depth (default: 2)--max-pages: (Optional) Max pages per domain (default: 50)--concurrency: (Optional) Parallel requests limit (default: 10)
cargo run --release -- --url https://quotes.toscrape.com --depth 1 --concurrency 10- Support for
robots.txt - Save results to SQLite/PostgreSQL
- Text content analysis
- Multi-domain crawl filtering
This project is licensed under the MIT License.