Thanks to visit codestin.com
Credit goes to github.com

Skip to content

aoshimash/urlmap

Repository files navigation

urlmap

CI Docker Go Report Card License

A fast and efficient web crawler CLI tool for discovering and mapping URLs within a website. Built with Go for high performance and concurrent crawling.

πŸš€ Features

  • Recursive Link Discovery: Automatically discover all links within a website
  • Same-Domain Filtering: Focus crawling on a specific domain to avoid external links
  • Concurrent Processing: High-performance crawling with configurable worker pools
  • Depth Limiting: Control crawl depth to prevent infinite recursion
  • Progress Indicators: Real-time progress reporting during crawling operations
  • Rate Limiting: Respectful crawling with configurable request rates
  • Graceful Shutdown: Interrupt-safe with proper cleanup on termination
  • Structured Logging: Comprehensive logging with verbose mode support
  • Multiple Output Formats: URLs output to stdout, logs to stderr
  • Custom User Agent: Configurable user agent strings for identification

πŸ“¦ Installation

Binary Download

Download the latest binary from the releases page:

Linux (x86_64)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-linux-amd64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

Linux (ARM64)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-linux-arm64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

macOS (Intel)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-darwin-amd64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

macOS (Apple Silicon)

curl -L -o urlmap.tar.gz https://github.com/aoshimash/urlmap/releases/latest/download/urlmap-darwin-arm64.tar.gz
tar -xzf urlmap.tar.gz
chmod +x urlmap
sudo mv urlmap /usr/local/bin/

Windows

Download urlmap-windows-amd64.zip from the releases page and extract the executable.

Docker

Run with Docker without installation:

# Pull from GitHub Container Registry
docker pull ghcr.io/aoshimash/urlmap:latest

# Basic usage
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com

From Source

Requirements: Go 1.21 or higher

# Clone the repository
git clone https://github.com/aoshimash/urlmap.git
cd urlmap

# Build the application
go build -o urlmap ./cmd/urlmap

# Install globally (optional)
sudo mv urlmap /usr/local/bin/

🎯 Usage

Basic Usage

# Crawl a website with default settings
urlmap https://example.com

# Check version
urlmap version

# Get help
urlmap --help

Advanced Options

# Limit crawl depth to 3 levels
urlmap --depth 3 https://example.com

# Use 20 concurrent workers for faster crawling
urlmap --concurrent 20 https://example.com

# Enable verbose logging
urlmap --verbose https://example.com

# Custom user agent
urlmap --user-agent "MyBot/1.0" https://example.com

# Rate limiting (5 requests per second)
urlmap --rate-limit 5 https://example.com

# Disable progress indicators
urlmap --progress=false https://example.com

# Combined options
urlmap --depth 5 --concurrent 15 --verbose --rate-limit 2 https://example.com

Docker Usage

# Basic crawling
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com

# With options
docker run --rm ghcr.io/aoshimash/urlmap:latest --depth 3 --concurrent 20 https://example.com

# Save output to file
docker run --rm ghcr.io/aoshimash/urlmap:latest https://example.com > urls.txt

# Interactive mode with shell access
docker run -it --rm ghcr.io/aoshimash/urlmap:latest /bin/sh

πŸ”§ Command Line Options

Flag Short Default Description
--depth -d -1 (unlimited) Maximum crawl depth
--concurrent -c 10 Number of concurrent workers
--verbose -v false Enable verbose logging
--user-agent -u urlmap/1.0.0 Custom User-Agent string
--progress -p true Show progress indicators
--rate-limit -r 0 (no limit) Rate limit (requests per second)
--help -h - Show help message

πŸ“‹ Examples

Basic Website Crawling

# Crawl a simple website
urlmap https://example.com

Output:

https://example.com
https://example.com/about
https://example.com/contact
https://example.com/products

Depth-Limited Crawling

# Only crawl up to 2 levels deep
urlmap --depth 2 https://blog.example.com

High-Performance Crawling

# Use 50 concurrent workers for large sites
urlmap --concurrent 50 --verbose https://large-site.example.com

Respectful Crawling

# Limit to 1 request per second with custom user agent
urlmap --rate-limit 1 --user-agent "Research Bot 1.0 ([email protected])" https://example.com

Save Results to File

# Save URLs to a file
urlmap https://example.com > discovered_urls.txt

# Save with timestamps and logs
urlmap --verbose https://example.com > urls.txt 2> crawl.log

Processing Large Sites

# Optimized for large sites with progress tracking
urlmap --depth 5 --concurrent 30 --rate-limit 10 --verbose https://large-site.com

πŸ— Architecture

urlmap follows a modular architecture for maintainability and extensibility:

urlmap/
β”œβ”€β”€ cmd/urlmap/          # CLI application entry point
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ client/          # HTTP client with retry logic
β”‚   β”œβ”€β”€ config/          # Configuration and logging setup
β”‚   β”œβ”€β”€ crawler/         # Core crawling engine
β”‚   β”œβ”€β”€ output/          # Output formatting and handling
β”‚   β”œβ”€β”€ parser/          # HTML parsing and link extraction
β”‚   β”œβ”€β”€ progress/        # Progress reporting and statistics
β”‚   └── url/            # URL validation and normalization
└── pkg/utils/          # Public utilities

Core Components

  • Crawler Engine: Concurrent crawler with worker pool architecture
  • HTTP Client: Resilient HTTP client with timeout and retry logic
  • Link Parser: HTML parser using goquery for reliable link extraction
  • URL Manager: URL validation, normalization, and domain filtering
  • Progress Reporter: Real-time crawling statistics and progress tracking

⚑ Performance

urlmap is optimized for performance with the following characteristics:

Benchmarks

  • Small sites (< 100 pages): ~50-100 URLs/second
  • Medium sites (100-1000 pages): ~30-50 URLs/second
  • Large sites (> 1000 pages): ~20-30 URLs/second

Performance varies based on:

  • Network latency and bandwidth
  • Target server response times
  • Number of concurrent workers
  • Page complexity and size

Optimization Tips

  1. Concurrent Workers: Increase --concurrent for I/O bound crawling
  2. Rate Limiting: Use --rate-limit to avoid overwhelming servers
  3. Depth Control: Set appropriate --depth to avoid infinite crawling
  4. Progress Tracking: Disable --progress=false for slight performance gain

Memory Usage

  • Base memory: ~10-20 MB
  • Per worker: ~1-2 MB
  • URL storage: ~100 bytes per URL
  • For 10,000 URLs: typically ~50-100 MB total

πŸ” Troubleshooting

Common Issues

Permission Denied

# Error: permission denied
sudo chmod +x urlmap
# Or install to user directory
mv urlmap ~/.local/bin/

DNS Resolution Failures

# Test URL accessibility first
curl -I https://example.com

# Check DNS resolution
nslookup example.com

# Use verbose mode for debugging
urlmap --verbose https://example.com

Rate Limiting / 429 Errors

# Reduce concurrent workers and add rate limiting
urlmap --concurrent 5 --rate-limit 1 https://example.com

Memory Issues with Large Sites

# Reduce concurrent workers
urlmap --concurrent 5 --depth 3 https://large-site.com

# Monitor memory usage
urlmap --verbose https://example.com 2>&1 | grep -i memory

SSL/TLS Certificate Errors

# Check certificate validity
curl -I https://example.com

# For development/testing only (not recommended for production)
# Currently not configurable - urlmap validates all certificates

Advanced Features

Robots.txt Compliance

Respect robots.txt rules and crawl delays:

# Enable robots.txt respect (follows Disallow/Allow rules and Crawl-delay)
urlmap --respect-robots https://example.com

# Combined with other options
urlmap --respect-robots --verbose --depth 5 https://example.com

Output Formats

Choose from multiple output formats:

# JSON output
urlmap --output-format json https://example.com

# CSV output
urlmap --output-format csv https://example.com

# XML output
urlmap --output-format xml https://example.com

# Default text output (one URL per line)
urlmap --output-format text https://example.com

JavaScript Rendering

For websites that load content dynamically with JavaScript:

# Enable JavaScript rendering
urlmap --js-render https://spa-website.com

# Configure browser and timeout
urlmap --js-render --js-browser firefox --js-timeout 60s https://example.com

Debugging

Enable verbose logging to troubleshoot issues:

urlmap --verbose https://example.com 2> debug.log

Log levels include:

  • INFO: General crawling progress
  • DEBUG: Detailed URL processing
  • WARN: Non-fatal issues (failed URLs, timeouts)
  • ERROR: Fatal errors that stop crawling

Performance Issues

If crawling is slow:

  1. Check Network: Test direct access to the target site
  2. Adjust Workers: Try different --concurrent values (5-50)
  3. Monitor Rate Limits: Ensure you're not being throttled
  4. Use Rate Limiting: Add --rate-limit to be more respectful
# Performance testing command
time urlmap --depth 2 --concurrent 20 https://example.com > /dev/null

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/aoshimash/urlmap.git
cd urlmap

# Install dependencies
go mod download

# Run tests
go test ./...

# Run linting
go vet ./...
golangci-lint run

# Build for development
go build -o urlmap ./cmd/urlmap

Project Structure

See Architecture Documentation for detailed information about the codebase structure and design decisions.

πŸ“š Dependencies

urlmap uses the following high-quality Go libraries:

  • Cobra - Modern CLI framework
  • Resty - HTTP client library
  • goquery - jQuery-like HTML parsing

πŸ“Š Monitoring and Statistics

urlmap provides detailed statistics during and after crawling:

# Example output with statistics
urlmap --verbose https://example.com

Statistics include:

  • Total URLs discovered
  • Successfully crawled URLs
  • Failed URLs with reasons
  • Maximum depth reached
  • Total crawling time
  • Average response time

πŸ”’ Security Considerations

  • urlmap respects robots.txt by default behavior of underlying HTTP libraries
  • Uses safe HTML parsing to prevent XSS in link extraction
  • Validates all URLs to prevent malicious redirects
  • Implements proper timeout handling to prevent hanging requests
  • Rate limiting capabilities help prevent accidental DoS

πŸ€– AI-Driven Development

This project serves as a practical experiment in AI-driven software development. As part of this exploration, the entire codebase was implemented using Cursor AI agent, including:

  • Project design and architecture
  • Issue creation and project management
  • Pull request creation and code reviews
  • Implementation of all features and functionality
  • Documentation and README creation

Important Note: There is not a single line of code written by a human in this repository. Everything was generated and managed by AI tools, demonstrating the current capabilities of AI-assisted development.

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

πŸ™‹β€β™€οΈ Support

πŸ—Ί Roadmap

Future enhancements planned:

  • βœ… Robots.txt respect configuration (v0.4.0+)
  • βœ… Custom output formats (JSON, CSV, XML) (Available now!)
  • Plugin system for custom processing
  • Distributed crawling support
  • Web UI for monitoring large crawls
  • Integration with popular data analysis tools

About

A fast recursive web crawler for extracting all URLs from websites.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •  

Languages