bookmark-sorter (cbclean)

Command-line tool to clean and organize Chrome bookmarks. It normalizes URLs, removes duplicates, assigns tags via rules, generates a change plan, and exports “clean” bookmarks to standard HTML with HTML/Markdown reports.

Features

Import from Chrome profile JSON or exported HTML (Netscape format).
URL normalization: drop tracking params, strip “www”, remove fragments, collapse double slashes.
Deduplication: hard duplicates by normalized URL and soft duplicates by title similarity.
Categorization:
- Rule-based: domains and keywords map to tag lists.
- AI (optional):
  - Embeddings with sentence-transformers to assign tags by semantic similarity.
  - External LLM (OpenAI-compatible) to label bookmarks using a prompt-guided schema.
Export: generate bookmarks.cleaned.html suitable for Chrome import (never writes back to profile files).
Reports: report.html and report.md with summary and planned changes.

Note: link liveness checks and embeddings/LLM modes are architected but simplified/stubbed in this version.

Quick Start

Prerequisites: Python 3.11+
Install

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e '.[dev]'

Using zsh? Quote extras: pip install -e '.[dev]'.

Fast alternative with uv

curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv .venv
uv pip install -p .venv/bin/python -e '.[dev]'

Run (demo config)

cbclean --config configs/config.local.yaml

Outputs are written to out/: bookmarks.cleaned.html, report.html, report.md.

Screenshot

Example report (HTML): open out/report.html after a run.
To include a static image in the repo, save a capture as docs/screenshot-report.png and Git will render it in this section.

Configuration

Primary example: configs/config.example.yaml. Key fields:

input:
  bookmarks_path: "~/.config/google-chrome/Default/Bookmarks"  # or data/samples/bookmarks.sample.json
  import_html: ""  # path to exported bookmarks HTML (if not reading JSON)
output:
  export_dir: "./out"
  report_formats: ["html", "md"]
normalize:
  strip_query_params: ["utm_*", "gclid", "yclid", "fbclid", "ref", "ref_src"]
  strip_fragments: true
  strip_www: true
dedup:
  title_similarity_threshold: 0.90
categorize:
  mode: "rules"            # rules | embeddings | llm
  rules_file: "./configs/rules.example.yaml"
apply:
  mode: "export_html"       # export_html | dry_run
network:
  enabled: false            # liveness is simplified and performs no requests

Rules example: configs/rules.example.yaml (domains/keywords → tags).

Common Workflows

Import from JSON: set input.bookmarks_path to your Chrome profile Bookmarks file. The app only exports; it never edits profile files.
Import from HTML: set input.import_html and apply.mode: export_html.
Quick demo: use configs/config.local.yaml (points to data/samples/bookmarks.sample.json).
AI categorization: install extras pip install -e '.[embed]' (or uv pip install -p .venv/bin/python -e '.[embed]'), set categorize.mode: embeddings, and apply.group_by: tag. The first run will download the model (e.g., all-MiniLM-L6-v2).
LLM categorization (OpenAI-compatible):
- Install extras: pip install -e '.[llm]' (or with uv).
- Set API key: export OPENAI_API_KEY=sk-... (or configure categorize.llm.api_key_env).
- In config: categorize.mode: llm, optionally provide categorize.llm.labels to guide categories, and set apply.group_by: tag-all.
- Controls: categorize.llm.batch_size (default 30), temperature (default 0.0), only_uncertain (classify only items without tags).

Architecture & Modules

Pipeline: import → normalize → dedup → classify → plan → apply → report.

chrome_reader.py — read from JSON/HTML.
normalize.py — URL normalization.
dedup.py — hard and soft duplicates.
classify_rules.py — tag assignment via YAML rules.
classify_embed.py — AI tagging via sentence-transformers (optional dependency). Falls back to no-op if not installed.
fetch.py — liveness stub for MVP.
propose.py — change plan generation.
apply.py — HTML export (no writes to Chrome profile).
report.py — reports via Jinja2 templates (templates/).
storage.py — SQLite cache for future network checks.

Repository Layout

src/cbclean/ — package and CLI (cbclean).
configs/ — example configs and tagging rules.
data/samples/ — sample inputs.
templates/ — report templates.
tests/ — pytest suite.

Development & Quality

Tests: pytest -q (or uv run pytest -q); coverage: pytest --cov=cbclean --cov-report=term-missing.
Lint/format: ruff check ., black . (or uv run ...).
Types: mypy src/cbclean (or uv run mypy src/cbclean).
Dev install: pip install -e '.[dev]'. Contributor guidelines: see AGENTS.md.

Safety & Limitations

The tool never modifies Chrome profile files; changes are exported to out/.
Current version does not perform network liveness checks.
Fully offline operation is supported (rules, normalization, deduplication, reporting).
LLM mode sends bookmark titles/URLs to the configured provider; review provider policies and redact sensitive data as needed.

Troubleshooting & FAQ

zsh “no matches found: .[dev]”: quote it — pip install -e '.[dev]'.
IDE can’t import cbclean: mark src/ as Sources Root or set PYTHONPATH=src. This repo includes tests/conftest.py and .pylintrc to help IDEs.
Empty JSON import: verify input.bookmarks_path and permissions.

Roadmap

Async liveness checks (aiohttp) with SQLite cache.
Embeddings-based and LLM-assisted categorization.
Rich dead-link handling (Wayback, move/delete strategies).
Additional performance metrics and detailed reports.

Note about badges:

Replace USER_OR_ORG in the CI and Codecov badge URLs with your actual GitHub org/user once the repo is pushed to GitHub. Codecov badge requires the Codecov app or CODECOV_TOKEN secret in the repo settings.

Comparison With Alternatives

Chrome Bookmark Manager:
- Pros: built-in UI, quick manual edits.
- Cons: no URL normalization, limited dedup, no reproducible rules or reports.
Browser cleanup/dedup extensions:
- Pros: convenient, visual.
- Cons: require browser permissions, may modify profile directly, limited offline reproducibility.
Cloud services (e.g., online bookmark managers):
- Pros: sharing, search, rich metadata.
- Cons: requires account/network, may not preserve Chrome HTML import/export fidelity, not deterministic offline.
This tool:
- Pros: offline-first, reproducible configs, exports-only (safe), CLI-friendly automation, HTML/MD reports.
- Cons: CLI-based; advanced features (liveness, embeddings/LLM) are stubs in MVP.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
configs		configs
data/samples		data/samples
out_real		out_real
out_real_embed		out_real_embed
out_real_llm		out_real_llm
src/cbclean		src/cbclean
templates		templates
tests		tests
.coverage		.coverage
.coverage.Sergeys-Mini.msmsoft.net.10820.XWgdoidx		.coverage.Sergeys-Mini.msmsoft.net.10820.XWgdoidx
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
AGENTS.md		AGENTS.md
README.md		README.md
TR.md		TR.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bookmark-sorter (cbclean)

Features

Quick Start

Screenshot

Configuration

Common Workflows

Architecture & Modules

Development & Quality

Safety & Limitations

Troubleshooting & FAQ

Roadmap

Comparison With Alternatives

About

Uh oh!

Releases

Packages

Languages

meteoritt/bookmark-sorter

Folders and files

Latest commit

History

Repository files navigation

bookmark-sorter (cbclean)

Features

Quick Start

Screenshot

Configuration

Common Workflows

Architecture & Modules

Development & Quality

Safety & Limitations

Troubleshooting & FAQ

Roadmap

Comparison With Alternatives

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages