Some python to help find (and delete) duplicate files and similar images. Customised particularly to help me sort through many backups of photos collected over the years - some cropped and lower quality copies and such.
Code for image duplicate detection pinched from https://github.com/philipbl/duplicate-images - I didn't want the mongoDB store, but the pHashing was super-useful.
Uses Poetry, so poetry install and then:
poetry run deduplicator --dir /path/to/search --output /file/to/save/resultspoetry run parse --input /path/to/image-results --output /path/to/save/output- utility script to ask questions of the output- For development,
poetry run pytest
e.g.
poetry run deduplicator --dir mocks/ --output output/out.txt --imagesSetting --delete will delete any EXACT MATCH (based on file-hash) files it finds after the first. Using this option means how you organise the directory is important for which original is kept - see data setup below.
This option only runs for md5 hash matching of files currently. The image fuzziness is still a bit too experimental for me to be confident in deleting :)
Adding the --images flag will ignore non-image files, and attempt to compare images based on size, content, etc. It will still detect duplicates (as they should be a perfect match!).
If you run this against a large directory of images it will thrash your CPU for a while.
If you want to edit the "fuzziness" of the image search, the FUZZINESS variable is currently hard-coded at the top of ./deduplicator/images.py if you're getting lots of false matches. I found 90% was letting too much through, and experimented before setting on 80.
How you arrange the files in the one directory it searches affects how the results are displayed - they are listed in the order they are found, which effectively becomes alphabetically. To illustrate:
+-- backups
| +-- duplicate-1.jpg
+-- for-web
| +-- duplicate-2.jpg
+-- originals
| +-- duplicate.jpgBecause backups comes ahead of for-web and originals, the results with be listed as backups/duplicate-1.jpg, for-web/duplicate-2.jpg, originals/duplicate.jpg, and only the backup will be kept if you run with --delete set.
In this example, originals might be the highest quality / correctly organised image that you want to keep, so some renaming would be a good idea:
+-- 01-originals
| +-- duplicate.jpg
+-- 02-backups
| +-- duplicate-1.jpg
+-- 03-for-web
| +-- duplicate-2.jpgNow 01-originals/duplicate.jpg will be kept, and the other two duplicates deleted. Happy days!
- Fix dockerised version and add instructions
- Finish test coverage
- Refactoring once tests in place - yield, list comp, data structures
- Allow multiple paths?
- I regret ditching the DB to store things. On large directories, it takes a long time! Upserting would be better
| State | # Files | Execution Time |
|---|---|---|
| Before file dupe detection | 66,497 | N/A |
| After file dupe dedection | 27,153 | ~12m30s |
| After delete 100% matches | 23,031 | |
| After delete 95% matches | 15,252 | (needed multiple runs, likely bug) |