-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix getauxval() in glibc-compatibility and fix some leaks (after LSan started to work) #33957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What are the exact cases when it does not work? |
This is actually a logical leak like unbounded container growth. |
Right.
This does not means that those syscalls will not work, but instead this means that invalid memory access will be if vsyscall was not initialized. The problem is that after I've tried to reproduce the problem with clock_gettime, but failed. I guess the reason is that it is called before first setenv, and so pointer already changed (and so it does not requires access to I saw this issue in #32928 where rocksdb as a cache layer for MergeTree metadata had been introduced, and rocksdb requires |
|
Yes, last time I checked, ClickHouse did not work without /proc on Linux. |
8dbfbcb to
b8149ad
Compare
|
Now LSan looks happy |
|
I also appreciate any help with #31833 We will be able to build |
getauxval() from glibc-compatibility did not work always correctly: - it does not work after setenv(), and this breaks vsyscalls, like sched_getcpu() [1] (and BaseDaemon.cpp always set TZ if timezone is defined, which is true for CI [2]). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1163404 [2]: ClickHouse#32928 (comment) - another think that is definitely broken is LSan (Leak Sanitizer), it relies on worked getauxval() but it does not work if __environ is not initialized yet (there is even a commit about this). And because of, at least, one leak had been introduced [3]: [3]: ClickHouse#33840 Fix this by using /proc/self/auxv. And let's see how many issues will LSan find... I've verified this patch manually by printing AT_BASE and compared it with output of LD_SHOW_AUXV. Signed-off-by: Azat Khuzhin <[email protected]>
CI founds after LSan had been fixed [1]:
01889_sqlite_read_write: [ FAIL ] 8.32 sec. - return code: 1
=================================================================
==20649==ERROR: LeakSanitizer: detected memory leaks
Indirect leak of 1968 byte(s) in 1 object(s) allocated from:
0 0xc5c1ffd in operator new(unsigned long) (/usr/bin/clickhouse+0xc5c1ffd)
1 0x25e32d0d in std::__1::__unique_if<DB::StorageInMemoryMetadata>::__unique_single std::__1::make_unique<DB::StorageInMemoryMetadata, DB::StorageInMemoryMetadata const&>(DB::StorageInMemoryMetadata c>
2 0x25e32d0d in DB::IStorage::setInMemoryMetadata(DB::StorageInMemoryMetadata const&) obj-x86_64-linux-gnu/../src/Storages/IStorage.h:194:22
3 0x29bdee98 in DB::StorageSQLite::StorageSQLite(DB::StorageID const&, std::__1::shared_ptr<sqlite3>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std:>
4 0x25ee61d6 in std::__1::shared_ptr<DB::StorageSQLite> shared_ptr_helper<DB::StorageSQLite>::create<DB::StorageID, std::__1::shared_ptr<sqlite3> const&, std::__1::basic_string<char, std::__1::char_tr>
5 0x25ee61d6 in DB::TableFunctionSQLite::executeImpl(std::__1::shared_ptr<DB::IAST> const&, std::__1::shared_ptr<DB::Context const>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1:>
SUMMARY: AddressSanitizer: 171256 byte(s) leaked in 130 allocation(s).
[1]: https://github.com/ClickHouse/ClickHouse/runs/4929706698?check_suite_focus=true
Signed-off-by: Azat Khuzhin <[email protected]>
Interesting, more leaks had been found, but there were no leaks in the previous run, will take a look. |
LSan found [1]:
Direct leak of 5170176 byte(s) in 5049 object(s) allocated from:
0 0xc598edd in malloc (/usr/bin/clickhouse+0xc598edd)
1 0x39679739 in (anonymous namespace)::itanium_demangle::initializeOutputStream(char*, unsigned long*, (anonymous namespace)::itanium_demangle::OutputStream&, unsigned long) obj-x86_64-linux-gnu/../contrib/libcxxabi/src/demangle/Utility.h:178:31
2 0x39679739 in __cxa_demangle obj-x86_64-linux-gnu/../contrib/libcxxabi/src/cxa_demangle.cpp:351:13
3 0x28f6f3ed in DB::executeQueryImpl(char const*, char const*, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) obj-x86_64-linux-gnu/../src/Interpreters/executeQuery.cpp:662:44
[1]:
https://s3.amazonaws.com/clickhouse-test-reports/33957/08f4f45fd9da923ae3e3fdd8a527c297d35247eb/stress_test__address__actions_.html
Signed-off-by: Azat Khuzhin <[email protected]>
CI reports [1]:
Indirect leak of 648 byte(s) in 9 object(s) allocated from:
...
2 0x12b96503 in DB::AggregateFunctionSimpleState::getReturnType() const obj-x86_64-linux-gnu/../src/AggregateFunctions/AggregateFunctionSimpleState.h:47:15
...
[1]: https://s3.amazonaws.com/clickhouse-test-reports/33957/08f4f45fd9da923ae3e3fdd8a527c297d35247eb/stress_test__address__actions_.html
After we can get this query by using query_log artifact:
$ wget https://s3.amazonaws.com/clickhouse-test-reports/33957/08f4f45fd9da923ae3e3fdd8a527c297d35247eb/stress_test__address__actions_/query_log_dump.tar
$ tar -xf query_log_dump.tar
$ clickhouse-local --path var/lib/clickhouse/
SELECT query
FROM system.query_log
ARRAY JOIN used_aggregate_function_combinators AS func
WHERE has(used_aggregate_functions, 'groupBitOr') AND has(used_aggregate_function_combinators, 'SimpleState') AND (type != 'QueryStart')
Query id: 5b7722b3-f77e-4e7e-bd0b-586d6d32a899
┌─query────────────────────────────────────────────────────────────────────────────┐
│ with groupBitOrSimpleState(number) as c select toTypeName(c), c from numbers(1); │
└──────────────────────────────────────────────────────────────────────────────────┘
Fixes: 01570_aggregator_combinator_simple_state.sql
Fixes: ClickHouse#16853
Signed-off-by: Azat Khuzhin <[email protected]>
Signed-off-by: Azat Khuzhin <[email protected]>
Changelog category (leave one):
-Improvement
+Bug Fix (user-visible misbehaviour in official stable or prestable release)@alexey-milovidov There were some leaks before, not that significant, but still maybe worth to backport. |
I've fixed few leaks, but apparently one leak had been left, it is hard to trigger, it is due to some exception safety in Aggregator code... Will take a look later. |
|
It does not work at all: |
|
@azat maybe better solution is to implement your own |
|
I've already destroyed the machine. It works if I use |
Indeed.
Ok, I guess it can be related to some LSM (yama or similar) |
|
getauxval() from glibc-compatibility did not work always correctly: - It does not work after setenv(), and this breaks vsyscalls, like sched_getcpu() [1] (and BaseDaemon.cpp always set TZ if timezone is defined, which is true for CI [2]). Also note, that fixing setenv() will not fix LSan, since the culprit is getauxval() [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1163404 [2]: ClickHouse#32928 (comment) - Another think that is definitely broken is LSan (Leak Sanitizer), it relies on worked getauxval() but it does not work if __environ is not initialized yet (there is even a commit about this). And because of, at least, one leak had been introduced [3]: [3]: ClickHouse#33840 Fix this by using /proc/self/auxv with fallback to environ solution to make it compatible with environment that does not allow reading from auxv (or no procfs). v2: add fallback to environ solution v3: fix return value for __auxv_init_procfs() Refs: ClickHouse#33957 Signed-off-by: Azat Khuzhin <[email protected]>
getauxval() from glibc-compatibility did not work always correctly: - It does not work after setenv(), and this breaks vsyscalls, like sched_getcpu() [1] (and BaseDaemon.cpp always set TZ if timezone is defined, which is true for CI [2]). Also note, that fixing setenv() will not fix LSan, since the culprit is getauxval() [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1163404 [2]: ClickHouse#32928 (comment) - Another think that is definitely broken is LSan (Leak Sanitizer), it relies on worked getauxval() but it does not work if __environ is not initialized yet (there is even a commit about this). And because of, at least, one leak had been introduced [3]: [3]: ClickHouse#33840 Fix this by using /proc/self/auxv with fallback to environ solution to make it compatible with environment that does not allow reading from auxv (or no procfs). v2: add fallback to environ solution v3: fix return value for __auxv_init_procfs() (cherry picked from commit f187c34) v4: more verbose message on errors, CI founds [1]: AUXV already has value (529267711) [1]: https://s3.amazonaws.com/clickhouse-test-reports/39103/2325f7e8442d1672ce5fb43b11039b6a8937e298/stress_test__memory__actions_.html v5: break at AT_NULL v6: ignore AT_IGNORE v7: suppress TSan and remove superior check to avoid abort() in case of race v8: proper suppressions (not inner function but itself) Refs: ClickHouse#33957 Signed-off-by: Azat Khuzhin <[email protected]>
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix getauxval() in glibc-compatibility, this should fix vsyscalls after
setenv(i.e. timezone is set in config), and LSan (and also fix some leaks that had been found by LSan)getauxval() from glibc-compatibility did not work always correctly:
it does not work after setenv(), and this breaks vsyscalls,
like sched_getcpu() 1 (and BaseDaemon.cpp always set TZ if timezone
is defined, which is true for CI 2).
another think that is definitely broken is LSan (Leak Sanitizer), it
relies on worked getauxval() but it does not work if __environ is not
initialized yet (there is even a commit about this).
And because of at least one issue hadn't been catched before keeper: fix memory leak in case of compression is used (default) #33840
Fix this by using /proc/self/auxv.
And let's see how many issues will LSan find...
I've verified this patch manually by printing AT_BASE and compared it
with output of LD_SHOW_AUXV.
Cc: @alesapin
Cc: @filimonov (#27492)
Cc: @alexey-milovidov (#28132)
Cc: @vitlibar (#15111)