Thanks to visit codestin.com
Credit goes to github.com

Skip to content

isdigit() expects an int, not an unsigned char #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

DimitriPapadopoulos
Copy link
Contributor

Rework #57.

This was referenced Aug 18, 2021
@LB--
Copy link
Member

LB-- commented Aug 18, 2021

I do not understand this change, unsigned char is always implicitly convertible to int, and the documentation for isdigit explicitly states that the value passed to the function must be representable as unsigned char or the behavior is undefined. Could you provide some more info on how the current code causes an issue that this PR fixes?

@DimitriPapadopoulos
Copy link
Contributor Author

It's all abut the value, not the representation. The signature of the function is clear, it expects an int. See the isdigit(3p) man page:

The c argument is an int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Now, if the argument is not representable as an unsigned char or equal to the value of the macro EOF, it's probably because there's an error elsewhere in the program. Casting to unsigned char will silently propagate the error and result in undefined behaviour. If you fear the value may exceed that of an unsigned char, then I suggest:

assert(c <= UCHAR_MAX)

Individual characters are routinely represented as int in C programs, because all character handling functions operate on int.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Aug 18, 2021

In any case, as you say, the json_char will eventually be cast to an int. So why cast to an unsigned char and then to an int, if the expected type is int?

Additionally, json_char is a char by default, so why cast from signed to unsigned and then back to signed?

@LB--
Copy link
Member

LB-- commented Aug 19, 2021

As I understand it, the point is to make sure that the values are in the range of an unsigned char (e.g. 0 to 255). If just casting directly from char or signed char to int or signed int, the value may be negative, and I believe that could result in undefined behavior according to the specification of isdigit.

@DimitriPapadopoulos
Copy link
Contributor Author

I see. OK then, but then I feel that cast should happen sooner.

This also raises the question of json_char and the structure of JSON data. What sort of JSON data does the library read? According to RFC8259:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

  • What is the point of json_char instead of a simple char? Isn't char enough to handle UTF-8?
  • We need to add real UTF-8 examples for testing, all of them contain simple ASCII as far as I can see.
  • Would unsigned char be better than char for UTF-8 and generally speaking non-ASCII encodings? I'll look for some info on the subject.

@LB--
Copy link
Member

LB-- commented Aug 19, 2021

I feel that cast should happen sooner.

I had that feeling too, we can certainly do that instead.

  • What is the point of json_char instead of a simple char? Isn't char enough to handle UTF-8?

It allows C++20 projects to use char8_t in theory. It's not currently tested or confirmed to work, but it should be simple to support. It also allows people to change or specify the signedness of the type if they need to.

  • We need to add real UTF-8 examples for testing, all of them contain simple ASCII as far as I can see.

I agree, I'm hoping to add many more test cases and get the tests running as part of GitHub Actions.

  • Would unsigned char be better than char for UTF-8 and generally speaking non-ASCII encodings? I'll look for some info on the subject.

I actually don't know. char, signed char, unsigned char, and char8_t are all distinct types, so depending on the user's environment, any choice we make could force users to perform annoying casts back and forth, hence the customization point.

@DimitriPapadopoulos
Copy link
Contributor Author

It allows C++20 projects to use char8_t in theory. It's not currently tested or confirmed to work, but it should be simple to support. It also allows people to change or specify the signedness of the type if they need to.

I'll investigate how non-ASCII UTF-8 text is usually handled in C. I would recommend using that single type, at least internally. Indeed, If we let end-users define json_char to exotic types and use that type internally, unexpected issues may arise, no matter how much we test (even with all of char, signed char, unsigned char, and char8_t). Additionally, it's hard to print an unknown type in C, so we should use a known type internally, as much as possible.

I'm not fundamentally against using json_char in the API. We just need to cast the API character type to the internal character type, in each function. However, when casting, we need to make sure casting is possible without side effects. As you wrote, the main issue here is signedness. Nowadays, types of identical signedness should cast implicitly without problem, even on exotic embedded platforms.

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Aug 20, 2021

It looks like we could use unsigned char internally, or char8_t when available:

We just need to use plain char for error messages or cast back to char anything else passed to strcpy() or printf().

Given the above, the (unsigned char) cast in #49 and #57 is retrospectively a very good (intermediate?) solution. I will keep digging and perhaps propose something different, in line with the above, in a different PR.

@LB--
Copy link
Member

LB-- commented Oct 30, 2021

Thanks for conducting research on this topic, I agree with your assessment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants