Using file system for deduplication #21

liquidcarbon · 2024-09-27T14:39:29Z

liquidcarbon
Sep 27, 2024
Maintainer

You're processing messy incoming data into several tables. The data is bundled in a way that often includes some experiments you've already seen.

Sometimes, for any number of reasons, you're only able to write a partial set of tables.

Across millions of possible experiments, how to ensure most efficiently that we're processing them completely, and only once? Not doing that upfront is certain to cause issues down the line.

You'd normally go with some kind of a database. This usually means extra setup and upkeep.

A concise solution, no matter where your files live: use file system as a database. Maintain a catalog of composite primary keys as empty objects, and simply check if a file/object exists. One GET request, pythonified and super easy with libraries like cloudpathlib. If the object is absent, do your processing, overwriting partial results as you go. If everything runs successfully, mark as complete by creating the object.

Maybe this is a "DUH" sort of revelation, maybe you'll find it useful.

new_experiments = ...  # have we seen them before?

from cloudpathlib import CloudPath

for exp in new_experiments:
    file_obj = CloudPath(f"s3://bucket/partition={exp.partition}/{exp.id}")
    if not file_obj.exists():
        do_processing()
        file_obj.touch()

Linkedin post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using file system for deduplication #21

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Using file system for deduplication #21

Uh oh!

Uh oh!

liquidcarbon Sep 27, 2024 Maintainer

Replies: 0 comments

liquidcarbon
Sep 27, 2024
Maintainer