Make cleaning adaptive to workload #13846
Replies: 3 comments 2 replies
-
|
Actually a more costly operation of cleaning is the file listing, if we infer the files to be cleaned just from the plan, that would be great, for e.g, only compaction and clustering yield legacy files, we can check these plan to see which files have been replaced? |
Beta Was this translation helpful? Give feedback.
-
|
Similar change we did was to infer upsert/delete parallelism based on input data partitions. This is some low-hanging fruit to make clean parallelism inferred as well. |
Beta Was this translation helpful? Give feedback.
-
|
@xushiyan Can we focus this discussion on a simple : problem-solution type framing? What's currently broken? AFAIK - the cleaning parallelism is controlled by number of files to delete as well as cleaner paralellism write config. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Throughout a lifetime of a table, number of files written by commits can vary a lot, for e.g., a bulk insert + upsert/inserts, or there were traffic spikes. The cleaning process, either inline or async, should adapt to the workload. For example, the parallelism can be dynamically inferred. Currently, for execution, it's capped at the configured value:
This number can be inferred based on the planned cleaning tasks. And cleaner utility can also anticipate the cleaning workload and warn about memory to be used being too low based on the plan.
Beta Was this translation helpful? Give feedback.
All reactions