Using file system for deduplication #21
liquidcarbon
started this conversation in
Show & Tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
You're processing messy incoming data into several tables. The data is bundled in a way that often includes some experiments you've already seen.
Sometimes, for any number of reasons, you're only able to write a partial set of tables.
Across millions of possible experiments, how to ensure most efficiently that we're processing them completely, and only once? Not doing that upfront is certain to cause issues down the line.
You'd normally go with some kind of a database. This usually means extra setup and upkeep.
A concise solution, no matter where your files live: use file system as a database. Maintain a catalog of composite primary keys as empty objects, and simply check if a file/object exists. One GET request, pythonified and super easy with libraries like cloudpathlib. If the object is absent, do your processing, overwriting partial results as you go. If everything runs successfully, mark as complete by creating the object.
Maybe this is a "DUH" sort of revelation, maybe you'll find it useful.
Linkedin post
Beta Was this translation helpful? Give feedback.
All reactions