2 releases
| 0.1.1 | Nov 5, 2025 |
|---|---|
| 0.1.0 | Nov 5, 2025 |
#2333 in Command line utilities
26KB
378 lines
crawlurls
crawlurls is a small, production-ready Rust CLI that walks HTML pages starting from one or more URLs and reports any links that match an optional pattern. It stays fast and polite by using asynchronous networking, bounded concurrency, and sensible defaults that keep the crawl inside your domain unless you opt out.
Features
- Async crawler built on
reqwest+tokiowith configurable concurrency - Link extraction accelerated with
rayonso large pages are processed in parallel - Optional regex filtering so you only keep URLs that matter
- Domain scoping helpers: stick to the starting host, include its subdomains, or allow all outbound links
- Depth limiting, redirect and timeout guards to keep runaway jobs in check
- Output to stdout or an additional file with automatic de-duplication
Installation
You need Rust 1.74 or newer.
cargo install --path .
Usage
crawlurls [OPTIONS] <URL>...
Key flags:
--pattern <REGEX>– only emit URLs that match the regex--max-depth <N>– stop enqueuing new pages deeper thanN--concurrency <N>– number of pages resolved in parallel (default 16)--allow-external– follow and report links that leave the starting host(s)--include-subdomains– treat subdomains of the starting host(s) as in scope--connect-timeout <SECS>/--request-timeout <SECS>– tame slow responses--max-redirects <N>– cap redirect chains--output <FILE>– write matched URLs to a file as well as stdout--exclude <REGEX>– skip URLs matching the regex (repeatable)--exclude-file <FILE>– load newline-delimited exclude patterns from a file--user-agent <STRING>– set a custom user agent
Examples
Print every link inside a domain:
crawlurls https://example.com
Find all product URLs that look like /products/ but stay on the origin host:
crawlurls https://example.com --pattern '/products/'
Discover sitemap entries across subdomains, with deeper crawl depth and a custom agent string:
crawlurls https://example.com \
https://blog.example.com \
--include-subdomains \
--max-depth 3 \
--user-agent "crawlurls-batch/1.0"
Write the matching URLs to an artifacts file for post-processing:
crawlurls https://example.com --pattern 'example\\.com/s' --output urls.txt
Avoid admin dashboards while still printing marketing pages:
crawlurls https://example.com --pattern '/promo/' --exclude '/admin'
Notes
- Only
httpandhttpsschemes are crawled. - Non-HTML responses are skipped automatically.
- Each matching URL is emitted once even if encountered multiple times.
- Be respectful of the sites you crawl; honor published guidance and consider adding delays if required.
Dependencies
~15–36MB
~537K SLoC