-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<format> Properly handle multibyte encodings in format strings #1815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| template <class charT, class integral> | ||
| void test_intergal_specs() { | ||
| void test_integral_specs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intergal specs: 💃👓💃. (No change requested, just thought this was a funny typo.)
miscco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I have reviewed it, how can I make all this encoding horror unseen?
| auto _UResult = _RANGES _Find_unchecked( | ||
| _Get_unwrapped(_STD move(_First)), _Get_unwrapped(_STD move(_Last)), _Val, _Pass_fn(_Proj)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not get why we are pulling those algorithms out. We do not really want the bounds checks and we already have the unchecked iterators ready.
Why not simply call _RANGES _Find_unchecked and _RANGES _Copy_unchecked ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For calling directly with range arguments when we don't have iterators handy, it's much simpler to call the range algorithm. (Recall that Range algorithms with range arguments don't perform bounds checks - they trust that begin() and end() return a valid range.)
| [[fallthrough]]; | ||
| case 1: | ||
| return 1; // all characters have only one code unit | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it with these newlines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a switch with cases that fall through, I like to use newlines after only the cases that do not fall through as a visual reinforcement of the control flow structure.
|
|
||
| // TRANSITION, add support for unicode/wide formats | ||
| _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]); | ||
| const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive this is broken
- We should use
_RANGES _Unchecked_find - If we have wide characters this is only one element, so we should not search
- If we have plain characters you just changed
_Fillto only contain _CharType() so you will always findend(_Specs._Fill)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to just store the length in character types of the fill character array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm too lazy to call
beginandendon a range when the range algorithm can do it for me. - Wide characters are UTF-16 on our implementation, so two may be needed to represent a fill character (the test case that uses 🏈, for example).
meow woof[4] = {quack};is equivalent tomeow woof[4] = {quack, {}, {}, {}};, not tomeow woof[4] = {quack, quack, quack, quack};.
we might want to just store the length in character types of the fill character array
I'm largely indifferent; it's just another char in _Basic_format_specs. I'll happily make the change if you think it's warranted; please say so more definitively than "we might."
| _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]); | ||
| const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')}; | ||
| for (; _Fill_left > 0; --_Fill_left) { | ||
| _Out = _RANGES copy(_Fill_char, _STD move(_Out)).out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto _RANGES _Unchecked_copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto "I see no benefit commensurate with the additional code". Do you?
|
|
||
| // TRANSITION, add support for unicode/wide formats | ||
| _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]); | ||
| const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to just store the length in character types of the fill character array
| } | ||
| } | ||
|
|
||
| case 4: // Assume UTF-8 (as does _Mbrtowc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are non-UTF-8 4-byte encodings (e.g. GB18030/code page 54936) possible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function (which is effectively a partially-inlined call to _Mbrtowc, which implements the equivalent for std::codecvt) and _Mbrtowc certainly don't think so. I'd like to do better, but I'm not prepared to overhaul how our locale machinery deals with character encodings in this PR.
Background: _Mbrtowc is a wrapper around a call to MultiByteToWideChar which relies on the anachronistic assumption that any encoded character can be converted to a single UCS-2 code unit. This is clearly wrong and has been for many years, but it's the mechanism we have right now for determining the extent of multibyte characters. I think we can do better, especially now that ICU is available so we may not need to write and maintain enormous quantities of per-supported-encoding code.
stl/inc/format
Outdated
| const auto _Len = static_cast<size_t>(_Last - _First); | ||
|
|
||
| if (_Ch < 0b1110'0000) { | ||
| if (_Ch >= 0b1100'0000 && _Len >= 2) { | ||
| return 2; | ||
| } | ||
|
|
||
| // non-lead byte or partial code unit | ||
| return -1; | ||
| } | ||
|
|
||
| if (_Ch < 0b1111'0000) { | ||
| if (_Len >= 3) { | ||
| return 3; | ||
| } | ||
|
|
||
| // partial code unit | ||
| return -1; | ||
| } | ||
|
|
||
| if (_Len >= 4) { | ||
| return 4; | ||
| } | ||
|
|
||
| // partial code unit | ||
| return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't properly nor consistently handle all invalid UTF-8. We should either treat the partial/invalid code units as a "�" (details omitted) upon all invalid UTF-8 and continue parsing (this is the preferred default error handling method for decoding regular UTF-8 text), or throw format_error upon all invalid UTF-8.
Test cases:
| Next bytes | Valid UTF-8? | Non-throwing | Throwing | Current implementation |
|---|---|---|---|---|
"\xc2\xa2" |
Yes | 2 "¢" |
2 | 2 |
"\xc0\x80" |
No | 1 "��" |
-1 | 2 |
"\xc0\x7b" |
No | 1 "�{" |
-1 | 2 (bad!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made no attempt to validate the encoding of the format string - we don't examine continuation bytes at all. I've only done the minimal amount of work needed to avoid running off the end of the format string. This seems reasonable for code that parses format strings, which are generally trusted.
…::operator*() error LNK2019: unresolved external symbol".
|
I've pushed two commits which should resolve the test failures:
|
|
x86 is passing, but there are truncation warnings-as-errors when building the STL itself for x64. |
... and simply saturate to `INT_MAX`.
| ptrdiff_t _Width = _Specs._Precision; | ||
| const _CharT* _Last = _Measure_string_prefix(_Value, _Width); | ||
|
|
||
| return _Write_aligned(_STD move(_Out), _Width, _Specs, _Align::_Left, [=](_OutputIt _Out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdness: we don't do width estimation for fill characters, the spec simply assumes they have width 1. Consequently, format("{:3.2}", "日本地図") yields "日 ", whereas format("{:本<3.2}", "日本地図") yields "日本". This seems like a standard defect.
This was due to my lazy attempt to avoid overflow checking by using |
stl/inc/format
Outdated
|
|
||
| const auto _Len = static_cast<size_t>(_Last - _First); | ||
|
|
||
| if (_Ch < 0b1110'0000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible future performance optimization would be branch prediction annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but decided to avoid them. The vague "arbitrarily more/less likely" wording in the Standard means you don't know if a compiler will reorder branches a bit, or move unlikely branches off into the wilderness with cold exception-throwing code. Something like the latter can be catastrophic to performance when you really are just making a guess about the distribution of inputs. I'm happy with ordering the branches by what we feel is decreasing frequency and leaving anything further to folks running PGO with (hopefully) representative inputs.
barcharcraz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good after those minor nodiscard changes.
Adds support for multibyte fill characters and width estimation for strings.
Pushes
_RANGES copyand_RANGES fillup from<algorithm>into<xutility>. (Should be a pure cut'n'paste.)Fixes #1576.
Draft for now because I need some more test coverage for width estimation with and without alignment.