Thanks to visit codestin.com
Credit goes to github.com

Skip to content

btbytes/MyScanDocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MyScanDocs

Automates the workflow of collecting scanned PDFs from Gmail, performing OCR, generating descriptive titles, and saving both the PDFs and Markdown transcripts locally.


🧩 Features

  • Monitors Gmail for scanner emails with a configurable subject (default "Scan from HP")
  • Downloads PDF attachments and stores them locally
  • Runs pdftotext, falling back to tesseract OCR when needed
  • Uses OpenAI (or compatible) models to generate clean filenames
  • Saves the PDF alongside a Markdown transcript of the extracted text
  • Applies a Gmail label (default SCANS_PROCESSED) to avoid reprocessing

📦 Prerequisites

  • Python 3.10+

  • uv for dependency and runtime management

  • Google Cloud project with the Gmail API enabled and OAuth client credentials (Desktop)

  • OpenAI API key if you want automatic naming (the script falls back to generic titles otherwise)

  • OCR tooling:

    brew install poppler tesseract

⚙️ Setup

  1. Clone the repository:

    git clone https://github.com/yourname/myscandocs.git
    cd myscandocs
  2. Create a .env file next to main.py with any overrides you need:

    cat <<'EOF' > .env
    SCAN_DOCS_EMAIL_SUBJECT="Scanned Document"
    SCAN_DOCS_SAVE_DIR=~/Documents/Scans
    SCAN_DOCS_PROCESSED_LABEL=SCANS_PROCESSED
    SCAN_DOCS_LLM_MODEL=gpt-4.1-mini
    OPENAI_API_KEY=sk-your-openai-key
    GOOGLE_CREDENTIALS_FILE=credentials.json
    GOOGLE_TOKEN_FILE=token.json
    EOF
    • Leave OPENAI_API_KEY empty to skip smart filenames.
    • Override SCAN_DOCS_GMAIL_QUERY if you need advanced filtering.
  3. Download your Google OAuth credentials and save them as credentials.json in the project root.


🚀 Usage

  1. Authenticate with Google (first run only):

    uv run python main.py
    • A browser window will open for OAuth consent.
    • The refresh token is stored in token.json.
  2. Subsequent runs reuse the saved token:

    uv run python main.py
    • Output files land in SCAN_DOCS_SAVE_DIR (default ~/Documents/Scans).

🧠 How It Works

  • Queries Gmail for messages matching SCAN_DOCS_GMAIL_QUERY (derived from SCAN_DOCS_EMAIL_SUBJECT and SCAN_DOCS_PROCESSED_LABEL by default).
  • Downloads every PDF attachment and stores it temporarily.
  • Extracts text with pdftotext; if no text layer exists, uses tesseract.
  • Sends the extracted text to the configured OpenAI model to mint a descriptive filename.
  • Writes the PDF and a Markdown transcript (<title>.md) to the save directory.
  • Labels the original email so it is not processed again.

🪄 Automation Ideas

  • Cron job every 10 minutes:

    */10 * * * * cd /path/to/myscandocs && uv run python main.py
  • Combine with Hazel or macOS Shortcuts to watch the output folder and trigger downstream workflows.


🧰 Troubleshooting

  • No Gmail results: Confirm the scanner is sending with the subject configured in SCAN_DOCS_EMAIL_SUBJECT or adjust SCAN_DOCS_GMAIL_QUERY.
  • OCR slow or blank: Make sure tesseract is installed with the eng language pack (brew install tesseract-lang).
  • LLM not naming files: Verify OPENAI_API_KEY is set and the chosen model is available to your account.
  • Permission errors: Ensure the save directory exists and is writable.

📜 License

MIT License © 2025 Pradeep Gowda

About

automation to download and save scanned documents from gmail

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages