DocuScout

DocuScout is a powerful documentation image scanner that helps you find and track documentation screenshots that need updating. Using advanced OCR technology and smart caching, it efficiently scans your documentation website and identifies images based on their content, URLs, and alt text.

Features

Scans XML sitemaps for documentation pages
Performs OCR on images to find specific content
Searches image URLs and alt text
Smart caching system:
- Caches downloaded images for faster subsequent runs
- Caches page HTML with 24-hour expiry
- Shows cache status for both pages and images
Real-time progress tracking:
- Overall progress bar with completion estimates
- Current page being processed
- Current image being processed
- Processing speed in pages/second
Handles various image formats:
- PNG, JPG, RGBA images
- Converts palette images with transparency
- Skips SVG images automatically
Dockerized for easy setup and use
Supports multiple search terms
Test mode for quick validation:
- Randomly samples 5 pages
- Outputs results directly to screen
Outputs results to CSV (in non-test mode)

Prerequisites

Docker
Docker Compose

That's all you need! Everything else is handled by the Docker container.

Setup

Clone this repository:

git clone [repository-url]
cd docuscout

Make the script executable:

chmod +x docuscout

Optional: Make DocuScout Globally Accessible

You can make DocuScout accessible from anywhere on your system by following these steps:

First, get the absolute path to your DocuScout installation:

cd /path/to/docuscout  # Navigate to where you cloned the repository
DOCUSCOUT_PATH=$(pwd)/docuscout

Create a symbolic link in /usr/local/bin:

sudo ln -s "$DOCUSCOUT_PATH" /usr/local/bin/docuscout

Now you can run DocuScout from any directory using just:

docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Term1" "Term2"

Notes:

Make sure to use absolute paths when creating the symbolic link
If you move the DocuScout directory after creating the symbolic link, you'll need to update the link
You can check where the symlink points to with: ls -l /usr/local/bin/docuscout

Usage

Basic Usage

Run with required arguments:

./docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Term1" "Term2"

Or if installed globally:

docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Term1" "Term2"

Quick Testing

Test mode (randomly samples 5 pages and outputs results to screen):

docuscout --sitemap "https://docs.example.com/sitemap.xml" --search-terms "Metabox" "Settings" --test

Example test output:

Test Results:
================================================================================
page_url,image_url,matched_term
docs/setup-guide,https://example.com/images/settings.png,Settings
docs/metabox,https://example.com/images/meta-options.jpg,Metabox
================================================================================

Advanced Usage

Multiple search terms:

docuscout --sitemap "https://example.com/sitemap.xml" --search-terms "Download Button" "Settings Panel" "Configuration"

With cache clearing:

docuscout --sitemap "https://docs.example.com/sitemap.xml" --search-terms "Metabox" "Settings" --clear-cache

With custom output file:

docuscout --sitemap "https://docs.example.com/sitemap.xml" --search-terms "Feature" "Setup" --output "my_results.csv"

Full example with all options:

docuscout \
  --sitemap "https://docs.example.com/sitemap.xml" \
  --search-terms "Setup Guide" "Configuration" "API Settings" \
  --clear-cache \
  --output "documentation_audit.csv" \
  --test

Output Format

The tool generates a CSV file with the following columns:

Column	Description
page_url	URL of the page containing the image
image_url	Direct URL to the image
matched_term	The search term that matched

Example output:

page_url,image_url,matched_term
https://example.com/docs/setup,https://example.com/images/settings.png,Settings Panel
https://example.com/docs/config,https://example.com/images/config.jpg,Configuration

Cache System

The tool uses a two-level caching system:

Image Cache

Location: cache/images/
Format: PNG files
Naming: MD5 hash of image URL
Persistence: Until manually cleared

Page Cache

Location: cache/pages/
Format: JSON files with content and timestamp
Expiry: 24 hours
Naming: MD5 hash of page URL

Clear all caches:

docuscout --clear-cache

Progress Display

The tool shows real-time progress with three components:

Scanning X pages for matching images...
Progress: 45%|████████               | 45/541 [00:30<00:37,  1.47 pages/s]
Page:  docs/setup-guide (✓)
Image: wp-content/uploads/screenshot.png (✓)

Progress bar shows completion percentage and timing estimates
Page status shows current page being processed (✓ = cached, ↓ = downloading)
Image status shows current image being processed (✓ = cached, ↓ = downloading)

Troubleshooting

Common Issues

Cloudflare Protection
- Symptom: Receiving HTML instead of XML from sitemap
- Solution:
  - Reduce request frequency
  - Contact site administrator for API access
  - Try accessing through a different endpoint
OCR Quality
- Symptom: Missing expected matches
- Solutions:
  - Ensure images are clear and readable
  - Try variations of search terms
  - Check image cache for corrupted files
Permission Issues
- Symptom: Cannot write to cache or output
- Solutions:
  - Check folder permissions
  - Ensure run.sh is executable
  - Run with sudo if needed

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

MIT License - feel free to use this tool for any purpose.

Acknowledgments

Uses Tesseract OCR for image text recognition
Built with Python and Docker
Inspired by the need to maintain up-to-date documentation screenshots


The README now includes:
1. New caching system details
2. Improved progress display information
3. Updated output format
4. More detailed troubleshooting
5. Better usage examples
6. Cache management details

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docs-list.py		docs-list.py
docuscout		docuscout
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocuScout

Features

Prerequisites

Setup

Optional: Make DocuScout Globally Accessible

Usage

Basic Usage

Quick Testing

Advanced Usage

Output Format

Cache System

Image Cache

Page Cache

Progress Display

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

cklosowski/docuscout

Folders and files

Latest commit

History

Repository files navigation

DocuScout

Features

Prerequisites

Setup

Optional: Make DocuScout Globally Accessible

Usage

Basic Usage

Quick Testing

Advanced Usage

Output Format

Cache System

Image Cache

Page Cache

Progress Display

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages