Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@alexey-milovidov
Copy link
Member

@alexey-milovidov alexey-milovidov commented May 29, 2020

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Revert #11029 because it leads to deadlock:

2020.05.29 19:27:21.595937 [ 122371 ] {} <Debug> SystemLog (system.metric_log): Existing table system.metric_log for system log has obsolete or different structure. Renaming it to metric_log_38
2020.05.29 19:29:21.596798 [ 122431 ] {} <Trace> SystemLog (system.query_log): Terminating
2020.05.29 19:29:21.596893 [ 122467 ] {} <Trace> SystemLog (system.part_log): Terminating
2020.05.29 19:29:21.596954 [ 122415 ] {} <Trace> SystemLog (system.trace_log): Terminating
2020.05.29 19:29:22.595544 [ 122402 ] {} <Trace> MergeTreeSequentialSource: Reading 20 marks from part 202005_3040_3662_491, total 152376 rows starting from the beginning of the part
2020.05.29 19:29:22.595559 [ 122407 ] {} <Trace> MergeTreeSequentialSource: Reading 134 marks from part 202005_588809_593890_4631, total 135213 rows starting from the beginning of the part
2020.05.29 19:29:22.595571 [ 122400 ] {} <Trace> MergeTreeSequentialSource: Reading 2 marks from part 202005_134371_134371_0, total 7 rows starting from the beginning of the part
2020.05.29 19:29:22.595589 [ 122403 ] {} <Trace> MergeTreeSequentialSource: Reading 7 marks from part 202005_3512_4011_100, total 5704 rows starting from the beginning of the part
2020.05.29 19:29:22.595597 [ 122408 ] {} <Trace> MergeTreeSequentialSource: Reading 2 marks from part 202005_594452_594452_0, total 39 rows starting from the beginning of the part
2020.05.29 19:29:22.595629 [ 122395 ] {} <Trace> MergeTreeSequentialSource: Reading 6 marks from part 202005_131094_134368_1766, total 24590 rows starting from the beginning of the part
2020.05.29 19:29:22.597069 [ 122400 ] {} <Trace> MergeTreeSequentialSource: Reading 2 marks from part 202005_134372_134372_0, total 1 rows starting from the beginning of the part
2020.05.29 19:29:22.609284 [ 122402 ] {} <Trace> MergeTreeSequentialSource: Reading 5 marks from part 202005_3663_3775_23, total 27680 rows starting from the beginning of the part
2020.05.29 19:29:22.617319 [ 122407 ] {} <Trace> MergeTreeSequentialSource: Reading 14 marks from part 202005_593891_594451_112, total 13284 rows starting from the beginning of the part
2020.05.29 19:29:22.643153 [ 122400 ] {} <Debug> system.metric_log (MergerMutator): Merge sorted 23 rows, containing 215 columns (215 merged, 0 gathered) in 121.050595204 sec., 0.19000319627705545 rows/sec., 324.91 B/sec.
2020.05.29 19:29:22.658152 [ 122400 ] {} <Trace> system.metric_log: Renaming temporary part tmp_merge_202005_134369_134372_1 to 202005_134369_134372_1.
2020.05.29 19:29:22.658425 [ 122400 ] {} <Trace> system.metric_log (MergerMutator): Merged 4 parts: from 202005_134369_134369_0 to 202005_134372_134372_0
2020.05.29 19:29:22.658460 [ 122400 ] {} <Debug> MemoryTracker: Peak memory usage: 8.00 MiB.
2020.05.29 19:29:22.691273 [ 122408 ] {} <Trace> MergeTreeSequentialSource: Reading 2 marks from part 202005_594453_594453_0, total 21 rows starting from the beginning of the part
2020.05.29 19:29:22.700304 [ 122408 ] {} <Trace> MergeTreeSequentialSource: Reading 2 marks from part 202005_594454_594454_0, total 18 rows starting from the beginning of the part
2020.05.29 19:29:22.694426 [ 122371 ] {} <Error> Application: Caught exception while loading metadata: Code: 473, e.displayText() = DB::Exception: WRITE locking attempt on "system.metric_log" has timed out! (120000ms) Possible deadlock avoided. Client should retry., Stack trace (when copying this message, always include the lines below):

0. /build/build_docker/../contrib/poco/Foundation/src/Exception.cpp:27: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x10883c80 in /opt/milovidov/clickhouse.5
1. /build/build_docker/../src/Common/Exception.cpp:32: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x925a21d in /opt/milovidov/clickhouse.5
2. /build/build_docker/../contrib/libcxx/include/string:2134: DB::IStorage::tryLockTimed(std::__1::shared_ptr<DB::RWLockImpl> const&, DB::RWLockImpl::Type, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::SettingTimespan<(DB::SettingTimespanIO)1> const&) const (.cold) @ 0xdc00eb4 in /opt/milovidov/clickhouse.5
3. /build/build_docker/../contrib/libcxx/include/type_traits:3696: DB::IStorage::lockExclusively(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::SettingTimespan<(DB::SettingTimespanIO)1> const&) @ 0xdbfdeb8 in /opt/milovidov/clickhouse.5
4. /build/build_docker/../contrib/libcxx/include/memory:4081: DB::DatabaseOnDisk::renameTable(DB::Context const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::IDatabase&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) @ 0xd5d8cb5 in /opt/milovidov/clickhouse.5
5. /build/build_docker/../contrib/libcxx/include/memory:4206: DB::InterpreterRenameQuery::execute() @ 0xd7bde82 in /opt/milovidov/clickhouse.5
6. /build/build_docker/../src/Interpreters/SystemLog.h:466: DB::SystemLog<DB::MetricLogElement>::prepareTable() @ 0x92d1505 in /opt/milovidov/clickhouse.5
7. /build/build_docker/../src/Interpreters/SystemLog.cpp:105: DB::SystemLogs::SystemLogs(DB::Context&, Poco::Util::AbstractConfiguration const&) @ 0x92b15e9 in /opt/milovidov/clickhouse.5
8. /build/build_docker/../contrib/libcxx/include/__mutex_base:140: DB::Context::initializeSystemLogs() @ 0xd548931 in /opt/milovidov/clickhouse.5
9. /build/build_docker/../programs/server/Server.cpp:596: DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) @ 0x92e47d8 in /opt/milovidov/clickhouse.5
10. /build/build_docker/../contrib/poco/Util/src/Application.cpp:334: Poco::Util::Application::run() @ 0x107b2cd7 in /opt/milovidov/clickhouse.5
11. /build/build_docker/../programs/server/Server.cpp:185: DB::Server::run() @ 0x92b2989 in /opt/milovidov/clickhouse.5
12. /build/build_docker/../programs/server/Server.cpp:1076: mainEntryClickHouseServer(int, char**) @ 0x92a8883 in /opt/milovidov/clickhouse.5
13. /build/build_docker/../contrib/libcxx/include/vector:461: main @ 0x9255919 in /opt/milovidov/clickhouse.5
14. /build/glibc_2.27-3ubuntu1yandex1/glibc-2.27/csu/../csu/libc-start.c:344: __libc_start_main @ 0x21bf7 in /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.27.so
15. _start @ 0x925502e in /opt/milovidov/clickhouse.5
 (version 20.5.1.1 (official build))

@blinkov blinkov added the pr-not-for-changelog This PR should not be mentioned in the changelog label May 29, 2020
@alexey-milovidov
Copy link
Member Author

First commit 4e9a326 will go to master.

@alexey-milovidov
Copy link
Member Author

Superseded by #11307

@azat
Copy link
Member

azat commented Jun 6, 2020

@alexey-milovidov do you have details on why the deadlock occurred?

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Jun 6, 2020

We have loaded system database and started up tables. Some tables started to perform background merge. Then we initialize SystemLogs and it tries to do RENAME TABLE.

It is introducing a sequence of R W R locks:

thread 1: R   R
thread 2:   W

that caused deadlock because our RWLocks are fair.

@azat
Copy link
Member

azat commented Jun 6, 2020

Got it, but what prevents from deadlock when flusher thread creates table?

@alexey-milovidov
Copy link
Member Author

I think that there is no difference, it can lead to the same deadlock (that will be timed out after 120 seconds by default).

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Jun 6, 2020

It will be fixed automatically when we will change the engine of system database to Atomic #7512
but only for new servers.

@azat
Copy link
Member

azat commented Jun 6, 2020

I think that there is no difference, it can lead to the same deadlock (that will be timed out after 120 seconds by default).

Indeed, so how about restoring this config directive?

@alexey-milovidov
Copy link
Member Author

It's harmful, because it may prevent server to startup for 120 seconds.

@azat
Copy link
Member

azat commented Jun 6, 2020

It's harmful, because it may prevent server to startup for 120 seconds.

And not only this, but eventually server will not start

It will be fixed automatically when we will change the engine of system database to Atomic #7512

Any ETA?

We have loaded system database and started up tables. Some tables started to perform background merge. Then we initialize SystemLogs and it tries to do RENAME TABLE.

FWIW I finally came up with a reproducible test for this (I was stuck a little but due to SYSTEM STOP MERGES (w/o database.table) stops merges only for existing tables, looks tricky)

@azat
Copy link
Member

azat commented Jun 9, 2020

Maybe it is a good idea to make SYSTEM FLUSH LOGS query create tables even if the queue is empty? (this will be enough for most cases I guess)

@alexey-milovidov
Copy link
Member Author

Yes, it will be Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants