Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tavplubix
Copy link
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Some replication queue entries might hang for temporary_directories_lifetime (1 day by default) with Directory tmp_merge_<part_name> or Part ... (state Deleting) already exists, but it will be deleted soon or similar error. It's fixed. Fixes #29616.

Detailed description / Documentation draft:
Probably fixes #31843.
It happens because of race condition on part removal:

  1. We are trying to commit some merged part to zk, but fail for some reason (another race condition or connection loss)
  2. This part becomes Outdated (with is_temp == false, directory name is just part name), cleanup thread will try to remove it.
  3. Another thread tries to execute log entry again and creates temporary part (with is_temp == true and directory name tmp_merge_<part_name>), but fails to add it it active set due to Part ... (state Deleting) already exists
  4. It tries to remove temporary part (in dtor), but fails to rename it to tmp_delete_<part_name>, because directory exists, because cleanup thread is removing Outdated part with the same name.
  5. It ignores the exception, exits dtor and leaves tmp_merge_<part_name> directory on disk, nobody (except clearOldTemporaryDirectories) will retry removal, because part is temporary and MergeTreeDataPart object does exist anymore.

Probably we should refactor parts removal mechanism, because it has became quite complex and fragile with all the flags (is_temp, keep_shared_data, force_keep_shared_data), special states (DeleteOnDestroy), IDisk interface and projections.

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Dec 3, 2021
@tavplubix tavplubix added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Dec 3, 2021
@alesapin alesapin self-assigned this Dec 3, 2021
@alesapin
Copy link
Member

alesapin commented Dec 3, 2021

Wow, my favorite method changed.

@tavplubix
Copy link
Member Author

Integration tests (asan, actions) - Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Stress tests - memory limit

}
if (parent_part)
{
std::optional<bool> keep_shared_data = keepSharedDataInDecoupledStorage();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obscure code for me, but Ok.

if (parent_part)
{
std::optional<bool> keep_shared_data = keepSharedDataInDecoupledStorage();
if (!keep_shared_data.has_value())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean....?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to understand the logic around force_keep_shared_data


part = get_part();
// The fetched part is valuable and should not be cleaned like a temp part.
part->is_temp = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we shouldn't remove this.

@tavplubix
Copy link
Member Author

Fast test - 01053_window_view_proc_hop_to_now and 01054_window_view_proc_tumble_to are flaky, cc: @kssenii, @Vxider

@tavplubix
Copy link
Member Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2021

update

✅ Branch has been successfully updated

@Vxider
Copy link
Contributor

Vxider commented Dec 10, 2021

Fast test - 01053_window_view_proc_hop_to_now and 01054_window_view_proc_tumble_to are flaky

I think it might caused by system freezing, and the SELECT sleep(3) is not enough to get the results. I could try to rewrite these flaky tests by using the.sh test, and retries in loop to fetch the results instead of sleep. For now, we can disable these tests if possible.

robot-clickhouse pushed a commit that referenced this pull request Dec 10, 2021
robot-clickhouse pushed a commit that referenced this pull request Dec 10, 2021
robot-clickhouse pushed a commit that referenced this pull request Dec 10, 2021
robot-clickhouse pushed a commit that referenced this pull request Dec 10, 2021
alexey-milovidov added a commit that referenced this pull request Dec 11, 2021
Backport #32201 to 21.12: Try fix 'Directory tmp_merge_<part_name>' already exists
tavplubix added a commit that referenced this pull request Dec 16, 2021
Backport #32201 to 21.10: Try fix 'Directory tmp_merge_<part_name>' already exists
tavplubix added a commit that referenced this pull request Dec 16, 2021
Backport #32201 to 21.9: Try fix 'Directory tmp_merge_<part_name>' already exists
tavplubix added a commit that referenced this pull request Dec 16, 2021
Backport #32201 to 21.11: Try fix 'Directory tmp_merge_<part_name>' already exists
tavplubix added a commit that referenced this pull request Dec 17, 2021
Backport #32201 to 21.8: Try fix 'Directory tmp_merge_<part_name>' already exists
@Felixoid Felixoid added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The problem of data part merging while using zero copy replication Temporary directory for merged part already exist

6 participants