A comprehensive tool for discovering and comparing URLs across two websites. Perfect for content parity checks, broken link detection, SEO audits, and migration validation.
This tool discovers all public URLs on two sites (via sitemaps and/or crawling), probes each URL for HTTP status and metadata, and outputs a detailed CSV comparison showing status codes, redirects, response times, and more.
Primary use cases:
- Content parity audits: Find pages that exist on one site but not the other
- Migration validation: Verify redirects and ensure no content is lost
- SEO checks: Identify 404s, broken links, and redirect chains
- QA testing: Validate site health and response times
✅ Comprehensive Discovery
- Parse sitemaps (including nested and gzipped)
- Intelligent web crawling with depth control
- Respects robots.txt and nofollow
✅ Detailed Probing
- HTTP status codes (initial and final after redirects)
- Redirect chain tracking
- Response time measurement
- Content-type detection
- HTML metadata extraction (title, canonical URL)
✅ Smart Comparison
- URL normalization for accurate matching
- Classification by comparison type (same, mismatch, redirect, error, etc.)
- Path-based comparison logic
✅ Production Ready
- Configurable concurrency and rate limiting
- Retry logic with exponential backoff
- Progress bars and detailed summaries
- Polite crawling practices
# Clone the repository
cd url-compare
# Install dependencies
pip install -r requirements.txt- Python 3.8+
- See
requirements.txtfor dependencies
python url-compare.py \
--site-a=https://ansiblepilot.com \
--site-b=https://ansiblebyexample.comThis will:
- Discover URLs from both sitemaps and crawling
- Probe all URLs
- Generate
urls-compare.csvwith comparison results
Create a config.yaml (see provided template):
site_a: "https://ansiblepilot.com"
site_b: "https://ansiblebyexample.com"
discovery: "both"
concurrency: 8
rate_limit_rps: 2
output: "urls-compare.csv"Then run:
python url-compare.py --config=config.yaml| Option | Default | Description |
|---|---|---|
--site-a |
required | URL of first site |
--site-b |
required | URL of second site |
--config |
config.yaml |
Path to YAML config file |
--output |
urls-compare.csv |
Output CSV file path |
| Option | Default | Description |
|---|---|---|
--discovery |
both |
Discovery method: sitemap, crawl, or both |
--crawl-max-depth |
2 |
Maximum crawl depth from homepage |
--sitemaps |
auto | Override sitemap URLs (space-separated list) |
| Option | Default | Description |
|---|---|---|
--concurrency |
8 |
Number of concurrent requests |
--rate-limit-rps |
2.0 |
Rate limit (requests/sec per host) |
--timeout-ms |
10000 |
Request timeout in milliseconds |
--max-redirects |
5 |
Maximum redirects to follow |
--retry |
2 |
Number of retries on network errors |
| Option | Default | Description |
|---|---|---|
--include-query |
false |
Include query strings in path keys |
--include-fragment |
false |
Include fragments (#) in path keys |
| Option | Default | Description |
|---|---|---|
--follow-robots |
true |
Respect robots.txt rules |
--user-agent |
URLCompareBot/1.0 |
User agent string |
python url-compare.py \
--site-a=https://ansiblepilot.com \
--site-b=https://ansiblebyexample.com \
--discovery=sitemap \
--output=sitemap-compare.csvpython url-compare.py \
--site-a=https://example.com \
--site-b=https://example.org \
--discovery=crawl \
--crawl-max-depth=3 \
--concurrency=16 \
--rate-limit-rps=5python url-compare.py \
--site-a=https://ansiblepilot.com \
--site-b=https://ansiblebyexample.com \
--include-queryThe output CSV contains one row per unique path, with detailed comparison data:
path_key- Normalized path for comparisonpresent_on_a/present_on_b- Whether URL exists on each sitesource_a/source_b- Discovery source (sitemap, crawl, both)initial_status_a/initial_status_b- Initial HTTP statusfinal_status_a/final_status_b- Final status after redirectsredirect_hops_a/redirect_hops_b- Number of redirectsfinal_url_a/final_url_b- Final URL after redirectsresponse_time_ms_a/response_time_ms_b- Response timecanonical_url_a/canonical_url_b- Canonical URL from HTMLtitle_a/title_b- Page title (if HTML)comparison_class- Classification of comparison resultnotes- Error messages and warnings
same_status- Both sites return the same statusa_only- URL exists only on site Ab_only- URL exists only on site Bstatus_mismatch- Different status codesredirect_both- Both sites redirectredirect_mismatch- One redirects, one doesn'terror_a- Site A returns 5xx errorerror_b- Site B returns 5xx error
# Pages on A but not B
grep ",true,false," urls-compare.csv
# Pages on B but not A
grep ",false,true," urls-compare.csv# All mismatches
grep "status_mismatch" urls-compare.csv
# Pages that are 200 on A but 404 on B
awk -F',' '$7==200 && $14==404' urls-compare.csv# All redirect mismatches
grep "redirect_mismatch" urls-compare.csv- Open the CSV in Excel or Google Sheets
- Apply filters to the header row
- Use pivot tables for aggregate analysis
The tool normalizes URLs to match paths across domains:
- ✅ Lowercase scheme and host
- ✅ Remove default ports (
:80,:443) - ✅ Strip trailing slashes (except root
/) - ✅ Remove fragments (
#...) by default - ✅ Remove or sort query parameters
- ✅ Remove tracking parameters (
utm_*,fbclid, etc.) - ✅ Collapse duplicate slashes
This ensures /page/ and /page are treated as the same path.
0- Success, no issues found1- Partial success, some errors or mismatches detected2- Fatal error (configuration, I/O, etc.)
- Authenticated/logged-in pages
- JavaScript-rendered content (no headless browser)
- Content diffing beyond title comparison
- Binary content comparison
- Start with
--discovery=sitemapfor large sites - Use
--rate-limit-rps=1for small/slow servers - Increase
--timeout-msfor slow-responding sites - Test with small crawl depths first
The robots.txt file blocks access. Use --follow-robots=false to override (use responsibly).
Increase --timeout-ms or check site availability.
The site has a redirect loop (browser would also fail).
- Check if sitemap exists at
/sitemap.xml - Increase
--crawl-max-depthfor deeper crawling - Verify the sites are accessible
url-compare/
├── url-compare.py # Main CLI entry point
├── url_normalizer.py # URL normalization logic
├── discovery.py # Sitemap & crawling
├── prober.py # HTTP probing
├── comparator.py # Comparison logic
├── csv_writer.py # CSV output
├── config.yaml # Configuration template
├── requirements.txt # Python dependencies
└── README.md # This file
from url_normalizer import URLNormalizer
from discovery import URLDiscoverer
from prober import URLProber
from comparator import URLComparator
from csv_writer import CSVWriter
config = {
'site_a': 'https://example.com',
'site_b': 'https://example.org',
'discovery': 'both',
'concurrency': 8,
'rate_limit_rps': 2
}
# Discovery
discoverer = URLDiscoverer(config)
urls_a = await discoverer.discover(config['site_a'], 'both')
urls_b = await discoverer.discover(config['site_b'], 'both')
# Probing
prober = URLProber(config)
results_a = await prober.probe_urls(urls_a.keys())
results_b = await prober.probe_urls(urls_b.keys())
# Comparison
comparator = URLComparator(config)
comparisons = comparator.compare(urls_a, results_a, urls_b, results_b)
# Output
CSVWriter.write_csv(comparisons, 'output.csv')
CSVWriter.print_summary(comparisons)Contributions welcome! Areas for improvement:
- Headless browser support for JS-rendered links
- Content diffing (HTML structure, text)
- Authentication support
- Database output option
- REST API interface
- Docker container
MIT License - See LICENSE file for details
For issues, questions, or feature requests, please open an issue on the repository.
Built according to PRD specifications for comparing ansiblepilot.com and ansiblebyexample.com.
Happy comparing! 🔍