Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@CaseyCarter
Copy link
Contributor

Adds support for multibyte fill characters and width estimation for strings.

Pushes _RANGES copy and _RANGES fill up from <algorithm> into <xutility>. (Should be a pure cut'n'paste.)

Fixes #1576.

Draft for now because I need some more test coverage for width estimation with and without alignment.

@CaseyCarter CaseyCarter added cxx20 C++20 feature format C++20/23 format labels Apr 9, 2021
@CaseyCarter CaseyCarter requested a review from a team as a code owner April 9, 2021 03:34

template <class charT, class integral>
void test_intergal_specs() {
void test_integral_specs() {
Copy link
Contributor Author

@CaseyCarter CaseyCarter Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intergal specs: 💃👓💃. (No change requested, just thought this was a funny typo.)

@CaseyCarter CaseyCarter marked this pull request as draft April 9, 2021 03:57
@CaseyCarter CaseyCarter marked this pull request as ready for review April 9, 2021 06:16
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I have reviewed it, how can I make all this encoding horror unseen?

Comment on lines -411 to -412
auto _UResult = _RANGES _Find_unchecked(
_Get_unwrapped(_STD move(_First)), _Get_unwrapped(_STD move(_Last)), _Val, _Pass_fn(_Proj));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get why we are pulling those algorithms out. We do not really want the bounds checks and we already have the unchecked iterators ready.

Why not simply call _RANGES _Find_unchecked and _RANGES _Copy_unchecked ?

Copy link
Contributor Author

@CaseyCarter CaseyCarter Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For calling directly with range arguments when we don't have iterators handy, it's much simpler to call the range algorithm. (Recall that Range algorithms with range arguments don't perform bounds checks - they trust that begin() and end() return a valid range.)

[[fallthrough]];
case 1:
return 1; // all characters have only one code unit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it with these newlines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a switch with cases that fall through, I like to use newlines after only the cases that do not fall through as a visual reinforcement of the control flow structure.


// TRANSITION, add support for unicode/wide formats
_Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]);
const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive this is broken

  1. We should use _RANGES _Unchecked_find
  2. If we have wide characters this is only one element, so we should not search
  3. If we have plain characters you just changed _Fill to only contain _CharType( ) so you will always find end(_Specs._Fill)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to just store the length in character types of the fill character array

Copy link
Contributor Author

@CaseyCarter CaseyCarter Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm too lazy to call begin and end on a range when the range algorithm can do it for me.
  2. Wide characters are UTF-16 on our implementation, so two may be needed to represent a fill character (the test case that uses 🏈, for example).
  3. meow woof[4] = {quack}; is equivalent to meow woof[4] = {quack, {}, {}, {}};, not to meow woof[4] = {quack, quack, quack, quack};.

we might want to just store the length in character types of the fill character array

I'm largely indifferent; it's just another char in _Basic_format_specs. I'll happily make the change if you think it's warranted; please say so more definitively than "we might."

_Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]);
const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')};
for (; _Fill_left > 0; --_Fill_left) {
_Out = _RANGES copy(_Fill_char, _STD move(_Out)).out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto _RANGES _Unchecked_copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto "I see no benefit commensurate with the additional code". Do you?

@barcharcraz barcharcraz changed the title Properly handle multibyte encodings in format strings <format> Properly handle multibyte encodings in format strings Apr 9, 2021

// TRANSITION, add support for unicode/wide formats
_Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]);
const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to just store the length in character types of the fill character array

}
}

case 4: // Assume UTF-8 (as does _Mbrtowc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are non-UTF-8 4-byte encodings (e.g. GB18030/code page 54936) possible here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (which is effectively a partially-inlined call to _Mbrtowc, which implements the equivalent for std::codecvt) and _Mbrtowc certainly don't think so. I'd like to do better, but I'm not prepared to overhaul how our locale machinery deals with character encodings in this PR.

Background: _Mbrtowc is a wrapper around a call to MultiByteToWideChar which relies on the anachronistic assumption that any encoded character can be converted to a single UCS-2 code unit. This is clearly wrong and has been for many years, but it's the mechanism we have right now for determining the extent of multibyte characters. I think we can do better, especially now that ICU is available so we may not need to write and maintain enormous quantities of per-supported-encoding code.

stl/inc/format Outdated
Comment on lines 424 to 449
const auto _Len = static_cast<size_t>(_Last - _First);

if (_Ch < 0b1110'0000) {
if (_Ch >= 0b1100'0000 && _Len >= 2) {
return 2;
}

// non-lead byte or partial code unit
return -1;
}

if (_Ch < 0b1111'0000) {
if (_Len >= 3) {
return 3;
}

// partial code unit
return -1;
}

if (_Len >= 4) {
return 4;
}

// partial code unit
return -1;
Copy link
Contributor

@statementreply statementreply Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't properly nor consistently handle all invalid UTF-8. We should either treat the partial/invalid code units as a "�" (details omitted) upon all invalid UTF-8 and continue parsing (this is the preferred default error handling method for decoding regular UTF-8 text), or throw format_error upon all invalid UTF-8.

Test cases:

Next bytes Valid UTF-8? Non-throwing Throwing Current implementation
"\xc2\xa2" Yes 2 "¢" 2 2
"\xc0\x80" No 1 "��" -1 2
"\xc0\x7b" No 1 "�{" -1 2 (bad!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made no attempt to validate the encoding of the format string - we don't examine continuation bytes at all. I've only done the minimal amount of work needed to avoid running off the end of the format string. This seems reasonable for code that parses format strings, which are generally trusted.

…::operator*() error LNK2019: unresolved external symbol".
@StephanTLavavej
Copy link
Member

I've pushed two commits which should resolve the test failures:

  • Remove constexpr from P0355R7_calendars_and_time_zones_formatting.
  • Workaround VSO-1308657 "Standard Library Header Units: std::projected::operator*() error LNK2019: unresolved external symbol".
    • The workaround is to add a definition that just calls abort().

@StephanTLavavej
Copy link
Member

x86 is passing, but there are truncation warnings-as-errors when building the STL itself for x64.

... and simply saturate to `INT_MAX`.
ptrdiff_t _Width = _Specs._Precision;
const _CharT* _Last = _Measure_string_prefix(_Value, _Width);

return _Write_aligned(_STD move(_Out), _Width, _Specs, _Align::_Left, [=](_OutputIt _Out) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdness: we don't do width estimation for fill characters, the spec simply assumes they have width 1. Consequently, format("{:3.2}", "日本地図") yields "日 ", whereas format("{:本<3.2}", "日本地図") yields "日本". This seems like a standard defect.

@CaseyCarter
Copy link
Contributor Author

x86 is passing, but there are truncation warnings-as-errors when building the STL itself for x64.

This was due to my lazy attempt to avoid overflow checking by using ptrdiff_t for width estimations. Those responsible have been sacked.

stl/inc/format Outdated

const auto _Len = static_cast<size_t>(_Last - _First);

if (_Ch < 0b1110'0000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible future performance optimization would be branch prediction annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but decided to avoid them. The vague "arbitrarily more/less likely" wording in the Standard means you don't know if a compiler will reorder branches a bit, or move unlikely branches off into the wilderness with cold exception-throwing code. Something like the latter can be catastrophic to performance when you really are just making a guess about the distribution of inputs. I'm happy with ordering the branches by what we feel is decreasing frequency and leaving anything further to folks running PGO with (hopefully) representative inputs.

Copy link
Contributor

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after those minor nodiscard changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cxx20 C++20 feature format C++20/23 format

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants