2 releases

0.1.1	Nov 5, 2025
0.1.0	Nov 5, 2025

#2333 in Command line utilities

MIT license

26KB
378 lines

crawlurls

crawlurls is a small, production-ready Rust CLI that walks HTML pages starting from one or more URLs and reports any links that match an optional pattern. It stays fast and polite by using asynchronous networking, bounded concurrency, and sensible defaults that keep the crawl inside your domain unless you opt out.

Features

Async crawler built on reqwest + tokio with configurable concurrency
Link extraction accelerated with rayon so large pages are processed in parallel
Optional regex filtering so you only keep URLs that matter
Domain scoping helpers: stick to the starting host, include its subdomains, or allow all outbound links
Depth limiting, redirect and timeout guards to keep runaway jobs in check
Output to stdout or an additional file with automatic de-duplication

Installation

You need Rust 1.74 or newer.

cargo install --path .

Usage

crawlurls [OPTIONS] <URL>...

Key flags:

--pattern <REGEX> – only emit URLs that match the regex
--max-depth <N> – stop enqueuing new pages deeper than N
--concurrency <N> – number of pages resolved in parallel (default 16)
--allow-external – follow and report links that leave the starting host(s)
--include-subdomains – treat subdomains of the starting host(s) as in scope
--connect-timeout <SECS> / --request-timeout <SECS> – tame slow responses
--max-redirects <N> – cap redirect chains
--output <FILE> – write matched URLs to a file as well as stdout
--exclude <REGEX> – skip URLs matching the regex (repeatable)
--exclude-file <FILE> – load newline-delimited exclude patterns from a file
--user-agent <STRING> – set a custom user agent

Examples

Print every link inside a domain:

crawlurls https://example.com

Find all product URLs that look like /products/ but stay on the origin host:

crawlurls https://example.com --pattern '/products/'

Discover sitemap entries across subdomains, with deeper crawl depth and a custom agent string:

crawlurls https://example.com \
  https://blog.example.com \
  --include-subdomains \
  --max-depth 3 \
  --user-agent "crawlurls-batch/1.0"

Write the matching URLs to an artifacts file for post-processing:

crawlurls https://example.com --pattern 'example\\.com/s' --output urls.txt

Avoid admin dashboards while still printing marketing pages:

crawlurls https://example.com --pattern '/promo/' --exclude '/admin'

Notes

Only http and https schemes are crawled.
Non-HTML responses are skipped automatically.
Each matching URL is emitted once even if encountered multiple times.
Be respectful of the sites you crawl; honor published guidance and consider adding delays if required.

Dependencies

~15–36MB
~537K SLoC