urlmap

A fast and efficient web crawler CLI tool for discovering and mapping URLs within a website. Built with Go for high performance and concurrent crawling.

🚀 Features

Recursive Link Discovery: Automatically discover all links within a website
Same-Domain Filtering: Focus crawling on a specific domain to avoid external links
Concurrent Processing: High-performance crawling with configurable worker pools
Depth Limiting: Control crawl depth to prevent infinite recursion
Progress Indicators: Real-time progress reporting during crawling operations
Rate Limiting: Respectful crawling with configurable request rates
Graceful Shutdown: Interrupt-safe with proper cleanup on termination
Structured Logging: Comprehensive logging with verbose mode support
Multiple Output Formats: URLs output to stdout, logs to stderr
Custom User Agent: Configurable user agent strings for identification

📦 Installation

Binary Download

Download the latest binary from the releases page:

Linux (x86_64)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-linux-amd64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

Linux (ARM64)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-linux-arm64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

macOS (Intel)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-darwin-amd64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

macOS (Apple Silicon)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-darwin-arm64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

Windows

Download urlmap-windows-amd64.zip from the releases page and extract the executable.

Docker

Run with Docker without installation:

# Pull from GitHub Container Registry
docker pull ghcr.io/aoshimash/urlmap:latest

# Basic usage
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com

From Source

Requirements: Go 1.21 or higher

# Clone the repository
git clone https://github.com/aoshimash/urlmap.git
cd urlmap

# Build the application
go build -o urlmap ./cmd/urlmap

# Install globally (optional)
sudo mv urlmap /usr/local/bin/

🎯 Usage

Basic Usage

# Crawl a website with default settings
urlmap https://example.com

# Check version
urlmap version

# Get help
urlmap --help

Advanced Options

# Limit crawl depth to 3 levels
urlmap --depth 3 https://example.com

# Use 20 concurrent workers for faster crawling
urlmap --concurrent 20 https://example.com

# Enable verbose logging
urlmap --verbose https://example.com

# Custom user agent
urlmap --user-agent "MyBot/1.0" https://example.com

# Rate limiting (5 requests per second)
urlmap --rate-limit 5 https://example.com

# Disable progress indicators
urlmap --progress=false https://example.com

# Combined options
urlmap --depth 5 --concurrent 15 --verbose --rate-limit 2 https://example.com

Docker Usage

# Basic crawling
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com

# With options
docker run --rm ghcr.io/aoshimash/urlmap:latest --depth 3 --concurrent 20 https://example.com

# Save output to file
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com > urls.txt

# Interactive mode with shell access
docker run -it --rm ghcr.io/aoshimash/urlmap:latest /bin/sh

🔧 Command Line Options

Flag	Short	Default	Description
`--depth`	`-d`	-1 (unlimited)	Maximum crawl depth
`--concurrent`	`-c`	10	Number of concurrent workers
`--verbose`	`-v`	false	Enable verbose logging
`--user-agent`	`-u`	urlmap/1.0.0	Custom User-Agent string
`--progress`	`-p`	true	Show progress indicators
`--rate-limit`	`-r`	0 (no limit)	Rate limit (requests per second)
`--help`	`-h`	-	Show help message

📋 Examples

Basic Website Crawling

# Crawl a simple website
urlmap https://example.com

Output:

https://example.com
https://example.com/about
https://example.com/contact
https://example.com/products

Depth-Limited Crawling

# Only crawl up to 2 levels deep
urlmap --depth 2 https://blog.example.com

High-Performance Crawling

# Use 50 concurrent workers for large sites
urlmap --concurrent 50 --verbose https://large-site.example.com

Respectful Crawling

# Limit to 1 request per second with custom user agent
urlmap --rate-limit 1 --user-agent "Research Bot 1.0 ([email protected])" https://example.com

Save Results to File

# Save URLs to a file
urlmap https://example.com > discovered_urls.txt

# Save with timestamps and logs
urlmap --verbose https://example.com > urls.txt 2> crawl.log

Processing Large Sites

# Optimized for large sites with progress tracking
urlmap --depth 5 --concurrent 30 --rate-limit 10 --verbose https://large-site.com

🏗 Architecture

urlmap follows a modular architecture for maintainability and extensibility:

urlmap/
├── cmd/urlmap/          # CLI application entry point
├── internal/
│   ├── client/          # HTTP client with retry logic
│   ├── config/          # Configuration and logging setup
│   ├── crawler/         # Core crawling engine
│   ├── output/          # Output formatting and handling
│   ├── parser/          # HTML parsing and link extraction
│   ├── progress/        # Progress reporting and statistics
│   └── url/            # URL validation and normalization
└── pkg/utils/          # Public utilities

Core Components

Crawler Engine: Concurrent crawler with worker pool architecture
HTTP Client: Resilient HTTP client with timeout and retry logic
Link Parser: HTML parser using goquery for reliable link extraction
URL Manager: URL validation, normalization, and domain filtering
Progress Reporter: Real-time crawling statistics and progress tracking

⚡ Performance

urlmap is optimized for performance with the following characteristics:

Benchmarks

Small sites (< 100 pages): ~50-100 URLs/second
Medium sites (100-1000 pages): ~30-50 URLs/second
Large sites (> 1000 pages): ~20-30 URLs/second

Performance varies based on:

Network latency and bandwidth
Target server response times
Number of concurrent workers
Page complexity and size

Optimization Tips

Concurrent Workers: Increase --concurrent for I/O bound crawling
Rate Limiting: Use --rate-limit to avoid overwhelming servers
Depth Control: Set appropriate --depth to avoid infinite crawling
Progress Tracking: Disable --progress=false for slight performance gain

Memory Usage

Base memory: ~10-20 MB
Per worker: ~1-2 MB
URL storage: ~100 bytes per URL
For 10,000 URLs: typically ~50-100 MB total

🔍 Troubleshooting

Common Issues

Permission Denied

# Error: permission denied
sudo chmod +x urlmap
# Or install to user directory
mv urlmap ~/.local/bin/

DNS Resolution Failures

# Test URL accessibility first
curl -I https://example.com

# Check DNS resolution
nslookup example.com

# Use verbose mode for debugging
urlmap --verbose https://example.com

Rate Limiting / 429 Errors

# Reduce concurrent workers and add rate limiting
urlmap --concurrent 5 --rate-limit 1 https://example.com

Memory Issues with Large Sites

# Reduce concurrent workers
urlmap --concurrent 5 --depth 3 https://large-site.com

# Monitor memory usage
urlmap --verbose https://example.com 2>&1 | grep -i memory

SSL/TLS Certificate Errors

# Check certificate validity
curl -I https://example.com

# For development/testing only (not recommended for production)
# Currently not configurable - urlmap validates all certificates

Advanced Features

Robots.txt Compliance

Respect robots.txt rules and crawl delays:

# Enable robots.txt respect (follows Disallow/Allow rules and Crawl-delay)
urlmap --respect-robots https://example.com

# Combined with other options
urlmap --respect-robots --verbose --depth 5 https://example.com

Output Formats

Choose from multiple output formats:

# JSON output
urlmap --output-format json https://example.com

# CSV output
urlmap --output-format csv https://example.com

# XML output
urlmap --output-format xml https://example.com

# Default text output (one URL per line)
urlmap --output-format text https://example.com

JavaScript Rendering

For websites that load content dynamically with JavaScript:

# Enable JavaScript rendering
urlmap --js-render https://spa-website.com

# Configure browser and timeout
urlmap --js-render --js-browser firefox --js-timeout 60s https://example.com

Debugging

Enable verbose logging to troubleshoot issues:

urlmap --verbose https://example.com 2> debug.log

Log levels include:

INFO: General crawling progress
DEBUG: Detailed URL processing
WARN: Non-fatal issues (failed URLs, timeouts)
ERROR: Fatal errors that stop crawling

Performance Issues

If crawling is slow:

Check Network: Test direct access to the target site
Adjust Workers: Try different --concurrent values (5-50)
Monitor Rate Limits: Ensure you're not being throttled
Use Rate Limiting: Add --rate-limit to be more respectful

# Performance testing command
time urlmap --depth 2 --concurrent 20 https://example.com > /dev/null

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/aoshimash/urlmap.git
cd urlmap

# Install dependencies
go mod download

# Run tests
go test ./...

# Run linting
go vet ./...
golangci-lint run

# Build for development
go build -o urlmap ./cmd/urlmap

Project Structure

See Architecture Documentation for detailed information about the codebase structure and design decisions.

📚 Dependencies

urlmap uses the following high-quality Go libraries:

Cobra - Modern CLI framework
Resty - HTTP client library
goquery - jQuery-like HTML parsing

📊 Monitoring and Statistics

urlmap provides detailed statistics during and after crawling:

# Example output with statistics
urlmap --verbose https://example.com

Statistics include:

Total URLs discovered
Successfully crawled URLs
Failed URLs with reasons
Maximum depth reached
Total crawling time
Average response time

🔒 Security Considerations

urlmap respects robots.txt by default behavior of underlying HTTP libraries
Uses safe HTML parsing to prevent XSS in link extraction
Validates all URLs to prevent malicious redirects
Implements proper timeout handling to prevent hanging requests
Rate limiting capabilities help prevent accidental DoS

🤖 AI-Driven Development

This project serves as a practical experiment in AI-driven software development. As part of this exploration, the entire codebase was implemented using Cursor AI agent, including:

Project design and architecture
Issue creation and project management
Pull request creation and code reviews
Implementation of all features and functionality
Documentation and README creation

Important Note: There is not a single line of code written by a human in this repository. Everything was generated and managed by AI tools, demonstrating the current capabilities of AI-assisted development.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙋‍♀️ Support

Bug Reports: GitHub Issues
Feature Requests: GitHub Discussions
Security Issues: Please email security issues privately

🗺 Roadmap

Future enhancements planned:

✅ Robots.txt respect configuration (v0.4.0+)
✅ Custom output formats (JSON, CSV, XML) (Available now!)
Plugin system for custom processing
Distributed crawling support
Web UI for monitoring large crawls
Integration with popular data analysis tools

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.claude		.claude
.cursor/rules		.cursor/rules
.github		.github
cmd/urlmap		cmd/urlmap
docs		docs
internal		internal
pkg/utils		pkg/utils
scripts		scripts
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
RELEASE_NOTES_v0.1.0.md		RELEASE_NOTES_v0.1.0.md
RELEASE_NOTES_v0.2.0.md		RELEASE_NOTES_v0.2.0.md
RELEASE_NOTES_v0.3.0.md		RELEASE_NOTES_v0.3.0.md
go.mod		go.mod
go.sum		go.sum
urlmap		urlmap

License

aoshimash/urlmap

Folders and files

Latest commit

History

Repository files navigation

urlmap

🚀 Features

📦 Installation

Binary Download

Linux (x86_64)

Linux (ARM64)

macOS (Intel)

macOS (Apple Silicon)

Windows

Docker

From Source

🎯 Usage

Basic Usage

Advanced Options

Docker Usage

🔧 Command Line Options

📋 Examples

Basic Website Crawling

Depth-Limited Crawling

High-Performance Crawling

Respectful Crawling

Save Results to File

Processing Large Sites

🏗 Architecture

Core Components

⚡ Performance

Benchmarks

Optimization Tips

Memory Usage

🔍 Troubleshooting

Common Issues

Permission Denied

DNS Resolution Failures

Rate Limiting / 429 Errors

Memory Issues with Large Sites

SSL/TLS Certificate Errors

Advanced Features

Robots.txt Compliance

Output Formats

JavaScript Rendering

Debugging

Performance Issues

🤝 Contributing

Development Setup

Project Structure

📚 Dependencies

📊 Monitoring and Statistics

🔒 Security Considerations

🤖 AI-Driven Development

📄 License

🙋‍♀️ Support

🗺 Roadmap

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages