DocuScout is a powerful documentation image scanner that helps you find and track documentation screenshots that need updating. Using advanced OCR technology and smart caching, it efficiently scans your documentation website and identifies images based on their content, URLs, and alt text.
- Scans XML sitemaps for documentation pages
- Performs OCR on images to find specific content
- Searches image URLs and alt text
- Smart caching system:
- Caches downloaded images for faster subsequent runs
- Caches page HTML with 24-hour expiry
- Shows cache status for both pages and images
- Real-time progress tracking:
- Overall progress bar with completion estimates
- Current page being processed
- Current image being processed
- Processing speed in pages/second
- Handles various image formats:
- PNG, JPG, RGBA images
- Converts palette images with transparency
- Skips SVG images automatically
- Dockerized for easy setup and use
- Supports multiple search terms
- Test mode for quick validation:
- Randomly samples 5 pages
- Outputs results directly to screen
- Outputs results to CSV (in non-test mode)
- Docker
- Docker Compose
That's all you need! Everything else is handled by the Docker container.
- Clone this repository:
git clone [repository-url]
cd docuscout- Make the script executable:
chmod +x docuscoutYou can make DocuScout accessible from anywhere on your system by following these steps:
- First, get the absolute path to your DocuScout installation:
cd /path/to/docuscout # Navigate to where you cloned the repository
DOCUSCOUT_PATH=$(pwd)/docuscout- Create a symbolic link in
/usr/local/bin:
sudo ln -s "$DOCUSCOUT_PATH" /usr/local/bin/docuscoutNow you can run DocuScout from any directory using just:
docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Term1" "Term2"Notes:
- Make sure to use absolute paths when creating the symbolic link
- If you move the DocuScout directory after creating the symbolic link, you'll need to update the link
- You can check where the symlink points to with:
ls -l /usr/local/bin/docuscout
Run with required arguments:
./docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Term1" "Term2"Or if installed globally:
docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Term1" "Term2"Test mode (randomly samples 5 pages and outputs results to screen):
docuscout --sitemap "https://docs.example.com/sitemap.xml" --search-terms "Metabox" "Settings" --testExample test output:
Test Results:
================================================================================
page_url,image_url,matched_term
docs/setup-guide,https://example.com/images/settings.png,Settings
docs/metabox,https://example.com/images/meta-options.jpg,Metabox
================================================================================
Multiple search terms:
docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Download Button" "Settings Panel" "Configuration"With cache clearing:
docuscout --sitemap "https://docs.example.com/sitemap.xml" --search-terms "Metabox" "Settings" --clear-cacheWith custom output file:
docuscout --sitemap "https://docs.example.com/sitemap.xml" --search-terms "Feature" "Setup" --output "my_results.csv"Full example with all options:
docuscout \
--sitemap "https://docs.example.com/sitemap.xml" \
--search-terms "Setup Guide" "Configuration" "API Settings" \
--clear-cache \
--output "documentation_audit.csv" \
--testThe tool generates a CSV file with the following columns:
| Column | Description |
|---|---|
| page_url | URL of the page containing the image |
| image_url | Direct URL to the image |
| matched_term | The search term that matched |
Example output:
page_url,image_url,matched_term
https://example.com/docs/setup,https://example.com/images/settings.png,Settings Panel
https://example.com/docs/config,https://example.com/images/config.jpg,Configuration
The tool uses a two-level caching system:
- Location:
cache/images/ - Format: PNG files
- Naming: MD5 hash of image URL
- Persistence: Until manually cleared
- Location:
cache/pages/ - Format: JSON files with content and timestamp
- Expiry: 24 hours
- Naming: MD5 hash of page URL
Clear all caches:
docuscout --clear-cacheThe tool shows real-time progress with three components:
Scanning X pages for matching images...
Progress: 45%|████████ | 45/541 [00:30<00:37, 1.47 pages/s]
Page: docs/setup-guide (✓)
Image: wp-content/uploads/screenshot.png (✓)
- Progress bar shows completion percentage and timing estimates
- Page status shows current page being processed (✓ = cached, ↓ = downloading)
- Image status shows current image being processed (✓ = cached, ↓ = downloading)
-
Cloudflare Protection
- Symptom: Receiving HTML instead of XML from sitemap
- Solution:
- Reduce request frequency
- Contact site administrator for API access
- Try accessing through a different endpoint
-
OCR Quality
- Symptom: Missing expected matches
- Solutions:
- Ensure images are clear and readable
- Try variations of search terms
- Check image cache for corrupted files
-
Permission Issues
- Symptom: Cannot write to cache or output
- Solutions:
- Check folder permissions
- Ensure run.sh is executable
- Run with sudo if needed
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - feel free to use this tool for any purpose.
- Uses Tesseract OCR for image text recognition
- Built with Python and Docker
- Inspired by the need to maintain up-to-date documentation screenshots
The README now includes:
1. New caching system details
2. Improved progress display information
3. Updated output format
4. More detailed troubleshooting
5. Better usage examples
6. Cache management details