A fast and efficient web crawler CLI tool for discovering and mapping URLs within a website. Built with Go for high performance and concurrent crawling.
- Recursive Link Discovery: Automatically discover all links within a website
- Same-Domain Filtering: Focus crawling on a specific domain to avoid external links
- Concurrent Processing: High-performance crawling with configurable worker pools
- Depth Limiting: Control crawl depth to prevent infinite recursion
- Progress Indicators: Real-time progress reporting during crawling operations
- Rate Limiting: Respectful crawling with configurable request rates
- Graceful Shutdown: Interrupt-safe with proper cleanup on termination
- Structured Logging: Comprehensive logging with verbose mode support
- Multiple Output Formats: URLs output to stdout, logs to stderr
- Custom User Agent: Configurable user agent strings for identification
Download the latest binary from the releases page:
curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-linux-amd64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-linux-arm64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-darwin-amd64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-darwin-arm64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/Download urlmap-windows-amd64.zip from the releases page and extract the executable.
Run with Docker without installation:
# Pull from GitHub Container Registry
docker pull ghcr.io/aoshimash/urlmap:latest
# Basic usage
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.comRequirements: Go 1.21 or higher
# Clone the repository
git clone https://github.com/aoshimash/urlmap.git
cd urlmap
# Build the application
go build -o urlmap ./cmd/urlmap
# Install globally (optional)
sudo mv urlmap /usr/local/bin/# Crawl a website with default settings
urlmap https://example.com
# Check version
urlmap version
# Get help
urlmap --help# Limit crawl depth to 3 levels
urlmap --depth 3 https://example.com
# Use 20 concurrent workers for faster crawling
urlmap --concurrent 20 https://example.com
# Enable verbose logging
urlmap --verbose https://example.com
# Custom user agent
urlmap --user-agent "MyBot/1.0" https://example.com
# Rate limiting (5 requests per second)
urlmap --rate-limit 5 https://example.com
# Disable progress indicators
urlmap --progress=false https://example.com
# Combined options
urlmap --depth 5 --concurrent 15 --verbose --rate-limit 2 https://example.com# Basic crawling
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com
# With options
docker run --rm ghcr.io/aoshimash/urlmap:latest --depth 3 --concurrent 20 https://example.com
# Save output to file
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com > urls.txt
# Interactive mode with shell access
docker run -it --rm ghcr.io/aoshimash/urlmap:latest /bin/sh| Flag | Short | Default | Description |
|---|---|---|---|
--depth |
-d |
-1 (unlimited) | Maximum crawl depth |
--concurrent |
-c |
10 | Number of concurrent workers |
--verbose |
-v |
false | Enable verbose logging |
--user-agent |
-u |
urlmap/1.0.0 | Custom User-Agent string |
--progress |
-p |
true | Show progress indicators |
--rate-limit |
-r |
0 (no limit) | Rate limit (requests per second) |
--help |
-h |
- | Show help message |
# Crawl a simple website
urlmap https://example.comOutput:
https://example.com
https://example.com/about
https://example.com/contact
https://example.com/products
# Only crawl up to 2 levels deep
urlmap --depth 2 https://blog.example.com# Use 50 concurrent workers for large sites
urlmap --concurrent 50 --verbose https://large-site.example.com# Limit to 1 request per second with custom user agent
urlmap --rate-limit 1 --user-agent "Research Bot 1.0 ([email protected])" https://example.com# Save URLs to a file
urlmap https://example.com > discovered_urls.txt
# Save with timestamps and logs
urlmap --verbose https://example.com > urls.txt 2> crawl.log# Optimized for large sites with progress tracking
urlmap --depth 5 --concurrent 30 --rate-limit 10 --verbose https://large-site.comurlmap follows a modular architecture for maintainability and extensibility:
urlmap/
βββ cmd/urlmap/ # CLI application entry point
βββ internal/
β βββ client/ # HTTP client with retry logic
β βββ config/ # Configuration and logging setup
β βββ crawler/ # Core crawling engine
β βββ output/ # Output formatting and handling
β βββ parser/ # HTML parsing and link extraction
β βββ progress/ # Progress reporting and statistics
β βββ url/ # URL validation and normalization
βββ pkg/utils/ # Public utilities
- Crawler Engine: Concurrent crawler with worker pool architecture
- HTTP Client: Resilient HTTP client with timeout and retry logic
- Link Parser: HTML parser using goquery for reliable link extraction
- URL Manager: URL validation, normalization, and domain filtering
- Progress Reporter: Real-time crawling statistics and progress tracking
urlmap is optimized for performance with the following characteristics:
- Small sites (< 100 pages): ~50-100 URLs/second
- Medium sites (100-1000 pages): ~30-50 URLs/second
- Large sites (> 1000 pages): ~20-30 URLs/second
Performance varies based on:
- Network latency and bandwidth
- Target server response times
- Number of concurrent workers
- Page complexity and size
- Concurrent Workers: Increase
--concurrentfor I/O bound crawling - Rate Limiting: Use
--rate-limitto avoid overwhelming servers - Depth Control: Set appropriate
--depthto avoid infinite crawling - Progress Tracking: Disable
--progress=falsefor slight performance gain
- Base memory: ~10-20 MB
- Per worker: ~1-2 MB
- URL storage: ~100 bytes per URL
- For 10,000 URLs: typically ~50-100 MB total
# Error: permission denied
sudo chmod +x urlmap
# Or install to user directory
mv urlmap ~/.local/bin/# Test URL accessibility first
curl -I https://example.com
# Check DNS resolution
nslookup example.com
# Use verbose mode for debugging
urlmap --verbose https://example.com# Reduce concurrent workers and add rate limiting
urlmap --concurrent 5 --rate-limit 1 https://example.com# Reduce concurrent workers
urlmap --concurrent 5 --depth 3 https://large-site.com
# Monitor memory usage
urlmap --verbose https://example.com 2>&1 | grep -i memory# Check certificate validity
curl -I https://example.com
# For development/testing only (not recommended for production)
# Currently not configurable - urlmap validates all certificatesRespect robots.txt rules and crawl delays:
# Enable robots.txt respect (follows Disallow/Allow rules and Crawl-delay)
urlmap --respect-robots https://example.com
# Combined with other options
urlmap --respect-robots --verbose --depth 5 https://example.comChoose from multiple output formats:
# JSON output
urlmap --output-format json https://example.com
# CSV output
urlmap --output-format csv https://example.com
# XML output
urlmap --output-format xml https://example.com
# Default text output (one URL per line)
urlmap --output-format text https://example.comFor websites that load content dynamically with JavaScript:
# Enable JavaScript rendering
urlmap --js-render https://spa-website.com
# Configure browser and timeout
urlmap --js-render --js-browser firefox --js-timeout 60s https://example.comEnable verbose logging to troubleshoot issues:
urlmap --verbose https://example.com 2> debug.logLog levels include:
- INFO: General crawling progress
- DEBUG: Detailed URL processing
- WARN: Non-fatal issues (failed URLs, timeouts)
- ERROR: Fatal errors that stop crawling
If crawling is slow:
- Check Network: Test direct access to the target site
- Adjust Workers: Try different
--concurrentvalues (5-50) - Monitor Rate Limits: Ensure you're not being throttled
- Use Rate Limiting: Add
--rate-limitto be more respectful
# Performance testing command
time urlmap --depth 2 --concurrent 20 https://example.com > /dev/nullWe welcome contributions! Please see our Contributing Guidelines for details.
# Clone the repository
git clone https://github.com/aoshimash/urlmap.git
cd urlmap
# Install dependencies
go mod download
# Run tests
go test ./...
# Run linting
go vet ./...
golangci-lint run
# Build for development
go build -o urlmap ./cmd/urlmapSee Architecture Documentation for detailed information about the codebase structure and design decisions.
urlmap uses the following high-quality Go libraries:
urlmap provides detailed statistics during and after crawling:
# Example output with statistics
urlmap --verbose https://example.comStatistics include:
- Total URLs discovered
- Successfully crawled URLs
- Failed URLs with reasons
- Maximum depth reached
- Total crawling time
- Average response time
- urlmap respects robots.txt by default behavior of underlying HTTP libraries
- Uses safe HTML parsing to prevent XSS in link extraction
- Validates all URLs to prevent malicious redirects
- Implements proper timeout handling to prevent hanging requests
- Rate limiting capabilities help prevent accidental DoS
This project serves as a practical experiment in AI-driven software development. As part of this exploration, the entire codebase was implemented using Cursor AI agent, including:
- Project design and architecture
- Issue creation and project management
- Pull request creation and code reviews
- Implementation of all features and functionality
- Documentation and README creation
Important Note: There is not a single line of code written by a human in this repository. Everything was generated and managed by AI tools, demonstrating the current capabilities of AI-assisted development.
This project is licensed under the MIT License. See the LICENSE file for details.
- Bug Reports: GitHub Issues
- Feature Requests: GitHub Discussions
- Security Issues: Please email security issues privately
Future enhancements planned:
- β Robots.txt respect configuration (v0.4.0+)
- β Custom output formats (JSON, CSV, XML) (Available now!)
- Plugin system for custom processing
- Distributed crawling support
- Web UI for monitoring large crawls
- Integration with popular data analysis tools