Make cleaning adaptive to workload #13846

xushiyan · 2025-09-05T16:31:26Z

xushiyan
Sep 5, 2025
Collaborator

Throughout a lifetime of a table, number of files written by commits can vary a lot, for e.g., a bulk insert + upsert/inserts, or there were traffic spikes. The cleaning process, either inline or async, should adapt to the workload. For example, the parallelism can be dynamically inferred. Currently, for execution, it's capped at the configured value:

The clean execution, i.e., the file deletion, is parallelized at file level, which is the unit of Spark task distribution. Similarly, the actual parallelism cannot exceed the configured value if the number of files is larger. If cleaning plan or execution is slow due to limited parallelism, you can increase this to tune the performance.

This number can be inferred based on the planned cleaning tasks. And cleaner utility can also anticipate the cleaning workload and warn about memory to be used being too low based on the plan.

danny0405 · 2025-09-06T03:38:05Z

danny0405
Sep 6, 2025
Collaborator

Actually a more costly operation of cleaning is the file listing, if we infer the files to be cleaned just from the plan, that would be great, for e.g, only compaction and clustering yield legacy files, we can check these plan to see which files have been replaced?

2 replies

xushiyan Sep 8, 2025
Collaborator Author

COW updates/deletes, insert overwrite also resulted in old versions right?

danny0405 Sep 9, 2025
Collaborator

yeah, COW is a different story.

xushiyan · 2025-09-08T23:29:15Z

xushiyan
Sep 8, 2025
Collaborator Author

Similar change we did was to infer upsert/delete parallelism based on input data partitions. This is some low-hanging fruit to make clean parallelism inferred as well.

0 replies

vinothchandar · 2025-09-15T20:52:38Z

vinothchandar
Sep 15, 2025
Collaborator

@xushiyan Can we focus this discussion on a simple : problem-solution type framing?

What's currently broken? AFAIK - the cleaning parallelism is controlled by number of files to delete as well as cleaner paralellism write config.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make cleaning adaptive to workload #13846

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Make cleaning adaptive to workload #13846

Uh oh!

xushiyan Sep 5, 2025 Collaborator

Replies: 3 comments · 2 replies

Uh oh!

danny0405 Sep 6, 2025 Collaborator

Uh oh!

xushiyan Sep 8, 2025 Collaborator Author

Uh oh!

danny0405 Sep 9, 2025 Collaborator

Uh oh!

xushiyan Sep 8, 2025 Collaborator Author

Uh oh!

vinothchandar Sep 15, 2025 Collaborator

xushiyan
Sep 5, 2025
Collaborator

Replies: 3 comments 2 replies

danny0405
Sep 6, 2025
Collaborator

xushiyan Sep 8, 2025
Collaborator Author

danny0405 Sep 9, 2025
Collaborator

xushiyan
Sep 8, 2025
Collaborator Author

vinothchandar
Sep 15, 2025
Collaborator