Automates the workflow of collecting scanned PDFs from Gmail, performing OCR, generating descriptive titles, and saving both the PDFs and Markdown transcripts locally.
- Monitors Gmail for scanner emails with a configurable subject (default "Scan from HP")
- Downloads PDF attachments and stores them locally
- Runs
pdftotext, falling back totesseractOCR when needed - Uses OpenAI (or compatible) models to generate clean filenames
- Saves the PDF alongside a Markdown transcript of the extracted text
- Applies a Gmail label (default
SCANS_PROCESSED) to avoid reprocessing
-
Python 3.10+
-
uv for dependency and runtime management
-
Google Cloud project with the Gmail API enabled and OAuth client credentials (Desktop)
-
OpenAI API key if you want automatic naming (the script falls back to generic titles otherwise)
-
OCR tooling:
brew install poppler tesseract
-
Clone the repository:
git clone https://github.com/yourname/myscandocs.git cd myscandocs -
Create a
.envfile next tomain.pywith any overrides you need:cat <<'EOF' > .env SCAN_DOCS_EMAIL_SUBJECT="Scanned Document" SCAN_DOCS_SAVE_DIR=~/Documents/Scans SCAN_DOCS_PROCESSED_LABEL=SCANS_PROCESSED SCAN_DOCS_LLM_MODEL=gpt-4.1-mini OPENAI_API_KEY=sk-your-openai-key GOOGLE_CREDENTIALS_FILE=credentials.json GOOGLE_TOKEN_FILE=token.json EOF
- Leave
OPENAI_API_KEYempty to skip smart filenames. - Override
SCAN_DOCS_GMAIL_QUERYif you need advanced filtering.
- Leave
-
Download your Google OAuth credentials and save them as
credentials.jsonin the project root.
-
Authenticate with Google (first run only):
uv run python main.py
- A browser window will open for OAuth consent.
- The refresh token is stored in
token.json.
-
Subsequent runs reuse the saved token:
uv run python main.py
- Output files land in
SCAN_DOCS_SAVE_DIR(default~/Documents/Scans).
- Output files land in
- Queries Gmail for messages matching
SCAN_DOCS_GMAIL_QUERY(derived fromSCAN_DOCS_EMAIL_SUBJECTandSCAN_DOCS_PROCESSED_LABELby default). - Downloads every PDF attachment and stores it temporarily.
- Extracts text with
pdftotext; if no text layer exists, usestesseract. - Sends the extracted text to the configured OpenAI model to mint a descriptive filename.
- Writes the PDF and a Markdown transcript (
<title>.md) to the save directory. - Labels the original email so it is not processed again.
-
Cron job every 10 minutes:
*/10 * * * * cd /path/to/myscandocs && uv run python main.py
-
Combine with Hazel or macOS Shortcuts to watch the output folder and trigger downstream workflows.
- No Gmail results: Confirm the scanner is sending with the subject configured in
SCAN_DOCS_EMAIL_SUBJECTor adjustSCAN_DOCS_GMAIL_QUERY. - OCR slow or blank: Make sure
tesseractis installed with theenglanguage pack (brew install tesseract-lang). - LLM not naming files: Verify
OPENAI_API_KEYis set and the chosen model is available to your account. - Permission errors: Ensure the save directory exists and is writable.
MIT License © 2025 Pradeep Gowda