isdigit() expects an int, not an unsigned char #120

DimitriPapadopoulos · 2021-08-18T10:24:52Z

Rework #57.

LB-- · 2021-08-18T15:12:43Z

I do not understand this change, unsigned char is always implicitly convertible to int, and the documentation for isdigit explicitly states that the value passed to the function must be representable as unsigned char or the behavior is undefined. Could you provide some more info on how the current code causes an issue that this PR fixes?

DimitriPapadopoulos · 2021-08-18T15:31:41Z

It's all abut the value, not the representation. The signature of the function is clear, it expects an int. See the isdigit(3p) man page:

The c argument is an int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Now, if the argument is not representable as an unsigned char or equal to the value of the macro EOF, it's probably because there's an error elsewhere in the program. Casting to unsigned char will silently propagate the error and result in undefined behaviour. If you fear the value may exceed that of an unsigned char, then I suggest:

assert(c <= UCHAR_MAX)

Individual characters are routinely represented as int in C programs, because all character handling functions operate on int.

DimitriPapadopoulos · 2021-08-18T15:39:43Z

In any case, as you say, the json_char will eventually be cast to an int. So why cast to an unsigned char and then to an int, if the expected type is int?

Additionally, json_char is a char by default, so why cast from signed to unsigned and then back to signed?

LB-- · 2021-08-19T19:17:23Z

As I understand it, the point is to make sure that the values are in the range of an unsigned char (e.g. 0 to 255). If just casting directly from char or signed char to int or signed int, the value may be negative, and I believe that could result in undefined behavior according to the specification of isdigit.

Rework json-parser#57.

DimitriPapadopoulos · 2021-08-19T20:37:31Z

I see. OK then, but then I feel that cast should happen sooner.

This also raises the question of json_char and the structure of JSON data. What sort of JSON data does the library read? According to RFC8259:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

What is the point of json_char instead of a simple char? Isn't char enough to handle UTF-8?
We need to add real UTF-8 examples for testing, all of them contain simple ASCII as far as I can see.
Would unsigned char be better than char for UTF-8 and generally speaking non-ASCII encodings? I'll look for some info on the subject.

LB-- · 2021-08-19T21:34:48Z

I feel that cast should happen sooner.

I had that feeling too, we can certainly do that instead.

What is the point of json_char instead of a simple char? Isn't char enough to handle UTF-8?

It allows C++20 projects to use char8_t in theory. It's not currently tested or confirmed to work, but it should be simple to support. It also allows people to change or specify the signedness of the type if they need to.

We need to add real UTF-8 examples for testing, all of them contain simple ASCII as far as I can see.

I agree, I'm hoping to add many more test cases and get the tests running as part of GitHub Actions.

Would unsigned char be better than char for UTF-8 and generally speaking non-ASCII encodings? I'll look for some info on the subject.

I actually don't know. char, signed char, unsigned char, and char8_t are all distinct types, so depending on the user's environment, any choice we make could force users to perform annoying casts back and forth, hence the customization point.

DimitriPapadopoulos · 2021-08-20T09:51:31Z

It allows C++20 projects to use char8_t in theory. It's not currently tested or confirmed to work, but it should be simple to support. It also allows people to change or specify the signedness of the type if they need to.

I'll investigate how non-ASCII UTF-8 text is usually handled in C. I would recommend using that single type, at least internally. Indeed, If we let end-users define json_char to exotic types and use that type internally, unexpected issues may arise, no matter how much we test (even with all of char, signed char, unsigned char, and char8_t). Additionally, it's hard to print an unknown type in C, so we should use a known type internally, as much as possible.

I'm not fundamentally against using json_char in the API. We just need to cast the API character type to the internal character type, in each function. However, when casting, we need to make sure casting is possible without side effects. As you wrote, the main issue here is signedness. Nowadays, types of identical signedness should cast implicitly without problem, even on exotic embedded platforms.

DimitriPapadopoulos · 2021-08-20T10:18:39Z

It looks like we could use unsigned char internally, or char8_t when available:

char8_t: A type for UTF-8 characters and strings

We just need to use plain char for error messages or cast back to char anything else passed to strcpy() or printf().

Given the above, the (unsigned char) cast in #49 and #57 is retrospectively a very good (intermediate?) solution. I will keep digging and perhaps propose something different, in line with the above, in a different PR.

LB-- · 2021-10-30T23:36:28Z

Thanks for conducting research on this topic, I agree with your assessment here.

This was referenced Aug 18, 2021

Portability issue on pebble #49

Closed

Fix Pebble compilation (#49) #57

Merged

isdigit() expects an int, not an unsigned char

0e38668

Rework json-parser#57.

DimitriPapadopoulos force-pushed the pebble branch from 4c20f6d to 0e38668 Compare August 19, 2021 20:18

DimitriPapadopoulos closed this Aug 20, 2021

DimitriPapadopoulos deleted the pebble branch August 20, 2021 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

isdigit() expects an int, not an unsigned char #120

isdigit() expects an int, not an unsigned char #120

Uh oh!

DimitriPapadopoulos commented Aug 18, 2021

Uh oh!

LB-- commented Aug 18, 2021 •

edited

Loading

Uh oh!

DimitriPapadopoulos commented Aug 18, 2021

Uh oh!

DimitriPapadopoulos commented Aug 18, 2021 •

edited

Loading

Uh oh!

LB-- commented Aug 19, 2021

Uh oh!

DimitriPapadopoulos commented Aug 19, 2021

8.1. Character Encoding

Uh oh!

LB-- commented Aug 19, 2021 •

edited

Loading

Uh oh!

DimitriPapadopoulos commented Aug 20, 2021

Uh oh!

DimitriPapadopoulos commented Aug 20, 2021 •

edited

Loading

Uh oh!

LB-- commented Oct 30, 2021

Uh oh!

Uh oh!

isdigit() expects an int, not an unsigned char #120

isdigit() expects an int, not an unsigned char #120

Uh oh!

Conversation

DimitriPapadopoulos commented Aug 18, 2021

Uh oh!

LB-- commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DimitriPapadopoulos commented Aug 18, 2021

Uh oh!

DimitriPapadopoulos commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LB-- commented Aug 19, 2021

Uh oh!

DimitriPapadopoulos commented Aug 19, 2021

8.1. Character Encoding

Uh oh!

LB-- commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DimitriPapadopoulos commented Aug 20, 2021

Uh oh!

DimitriPapadopoulos commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LB-- commented Oct 30, 2021

Uh oh!

Uh oh!

LB-- commented Aug 18, 2021 •

edited

Loading

DimitriPapadopoulos commented Aug 18, 2021 •

edited

Loading

LB-- commented Aug 19, 2021 •

edited

Loading

DimitriPapadopoulos commented Aug 20, 2021 •

edited

Loading