MyScanDocs

Automates the workflow of collecting scanned PDFs from Gmail, performing OCR, generating descriptive titles, and saving both the PDFs and Markdown transcripts locally.

🧩 Features

Monitors Gmail for scanner emails with a configurable subject (default "Scan from HP")
Downloads PDF attachments and stores them locally
Runs pdftotext, falling back to tesseract OCR when needed
Uses OpenAI (or compatible) models to generate clean filenames
Saves the PDF alongside a Markdown transcript of the extracted text
Applies a Gmail label (default SCANS_PROCESSED) to avoid reprocessing

📦 Prerequisites

Python 3.10+
uv for dependency and runtime management
Google Cloud project with the Gmail API enabled and OAuth client credentials (Desktop)
OpenAI API key if you want automatic naming (the script falls back to generic titles otherwise)
OCR tooling:
```
brew install poppler tesseract
```

⚙️ Setup

Clone the repository:

git clone https://github.com/yourname/myscandocs.git
cd myscandocs

Create a .env file next to main.py with any overrides you need:

cat <<'EOF' > .env
SCAN_DOCS_EMAIL_SUBJECT="Scanned Document"
SCAN_DOCS_SAVE_DIR=~/Documents/Scans
SCAN_DOCS_PROCESSED_LABEL=SCANS_PROCESSED
SCAN_DOCS_LLM_MODEL=gpt-4.1-mini
OPENAI_API_KEY=sk-your-openai-key
GOOGLE_CREDENTIALS_FILE=credentials.json
GOOGLE_TOKEN_FILE=token.json
EOF

Leave OPENAI_API_KEY empty to skip smart filenames.
Override SCAN_DOCS_GMAIL_QUERY if you need advanced filtering.

Download your Google OAuth credentials and save them as credentials.json in the project root.

🚀 Usage

Authenticate with Google (first run only):
```
uv run python main.py
```
- A browser window will open for OAuth consent.
- The refresh token is stored in token.json.
Subsequent runs reuse the saved token:
```
uv run python main.py
```
- Output files land in SCAN_DOCS_SAVE_DIR (default ~/Documents/Scans).

🧠 How It Works

Queries Gmail for messages matching SCAN_DOCS_GMAIL_QUERY (derived from SCAN_DOCS_EMAIL_SUBJECT and SCAN_DOCS_PROCESSED_LABEL by default).
Downloads every PDF attachment and stores it temporarily.
Extracts text with pdftotext; if no text layer exists, uses tesseract.
Sends the extracted text to the configured OpenAI model to mint a descriptive filename.
Writes the PDF and a Markdown transcript (<title>.md) to the save directory.
Labels the original email so it is not processed again.

🪄 Automation Ideas

Cron job every 10 minutes:

*/10 * * * * cd /path/to/myscandocs && uv run python main.py

Combine with Hazel or macOS Shortcuts to watch the output folder and trigger downstream workflows.

🧰 Troubleshooting

No Gmail results: Confirm the scanner is sending with the subject configured in SCAN_DOCS_EMAIL_SUBJECT or adjust SCAN_DOCS_GMAIL_QUERY.
OCR slow or blank: Make sure tesseract is installed with the eng language pack (brew install tesseract-lang).
LLM not naming files: Verify OPENAI_API_KEY is set and the chosen model is available to your account.
Permission errors: Ensure the save directory exists and is writable.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MyScanDocs

🧩 Features

📦 Prerequisites

⚙️ Setup

🚀 Usage

🧠 How It Works

🪄 Automation Ideas

🧰 Troubleshooting

📜 License

About

Uh oh!

Releases

Packages

Languages

btbytes/MyScanDocs

Folders and files

Latest commit

History

Repository files navigation

MyScanDocs

🧩 Features

📦 Prerequisites

⚙️ Setup

🚀 Usage

🧠 How It Works

🪄 Automation Ideas

🧰 Troubleshooting

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages