A super-fast, in-memory duplicate & similar image finder built on perceptual-hash bucketing and Numba-accelerated comparison.
- Zero on-disk output: everything runs in RAM
- Exact & “similar” mode (custom MSE threshold)
- Perceptual-hash + histogram pre-bucketing to prune comparisons
- Numba-JIT mean-squared-error with early bailout
- Thread-pooled image loading & feature extraction
- CLI and Python API
pip install difpy2
Requires Python ≥ 3.12
Quickstart
CLI
bash
Copy
Edit
difpy2 \
-D /path/to/images \
--px_size 50 \
--bins 8 \
--sim 0.0 # exact duplicates only; use >0 for “similar” mode
Options
-D, --dirs … one or more image directories
-r, --recursive … recurse into subfolders
-px, --px_size … resize images to px×px for comparison
-b, --bins … per-channel histogram buckets
-s, --sim … MSE threshold (0.0 = exact only)
-t, --threads … number of worker threads
Python API
python
Copy
Edit
from difpy2 import DuplicateFinder
finder = DuplicateFinder(
directories=["/path/to/images"],
px_size=50,
hist_bins=8,
similarity=0.0, # exact duplicates
threads=4,
)
results, lower_quality, stats = finder.run()
# results: { primary_image_path: [[duplicate_path, mse], …], … }
# lower_quality: [all duplicate/similar image paths]
# stats: { total_files, featurized, groups, duration_s }
Project Layout
arduino
Copy
Edit
difpy2/
├── difpy_opt.py # core implementation
├── README.md
├── LICENSE.txt
├── pyproject.toml
└── …
Contributing
Fork the repo
Create a feature branch
Run tests & linters
Submit PR
License
This project is licensed under the MIT License. See LICENSE.txt.