Thanks to visit codestin.com
Credit goes to github.com

Skip to content

bpo-29803: remove some redandunt ops in unicodeobject.c #660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 13, 2018

Conversation

zhangyangyu
Copy link
Member

@zhangyangyu zhangyangyu commented Mar 13, 2017

@zhangyangyu zhangyangyu added the type-feature A feature request or enhancement label Mar 13, 2017
if (unicode_decode_call_errorhandler_writer(
errors, &errorHandler,
"unicodeescape", message,
&starts, &end, &startinpos, &endinpos, &exc, &s,
&writer)) {
goto onError;
}
if (_PyUnicodeWriter_Prepare(&writer, writer.min_length, 127) < 0) {
goto onError;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The necessary widen has been done in unicode_decode_call_errorhandler_writer. So I think this is not a must.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about this? How could it lead to crash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even more important. Since WRITE_CHAR doesn't check the size of the output buffer, we need to allocate the space for writer.min_length = end - s + writer.pos characters past the last written character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. But isn't the widen done in unicode_decode_call_errorhandler_writer? When the error handler generates only one character, we don't need more space since we have already got enough space. But when the error handler generates more, unicode_decode_call_errorhandler_writer allocates the spaces for you. You did the change to unicode_decode_call_errorhandler_writer to avoid crash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remove all details, but when I wrote this code the call of _PyUnicodeWriter_Prepare() was needed.

Maybe something was changed since that time. I will examine thу code in detail some time later. But now I have no confidence that the removal of this call is safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add assert(writer.min_length <= writer.size - writer.pos) and see how Python crashes when run tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot understand. :-( writer.min_length means the least needed space here and it means the least space _PyUnicodeWriter will allocate for you. And writer.size is the actually allocated size. So shouldn't the right assertion here is just assert(writer.min_length <= writer.size). If you minus writer.pos, it means the left space, then should use assert(end - s <= writer.size - writer.pos).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The right assert is assert(end - s <= writer.size - writer.pos). Seems _PyUnicodeWriter_Prepare() is incorrectly used here and in unicode_decode_call_errorhandler_writer(). And min_length may be inconsistently used in different decoders.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think it's not necessary but not an error.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one legitimate fix of copy-paste error, one questionable change and few incorrect removes.

@@ -3922,10 +3922,6 @@ PyUnicode_FSDecoder(PyObject* arg, void* addr)
}

if (PyUnicode_Check(path)) {
if (PyUnicode_READY(path) == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an error but not a must. After the if ... else ... we get a code path to ready it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@@ -6086,17 +6082,13 @@ _PyUnicode_DecodeUnicodeEscape(const char *s,

error:
endinpos = s-starts;
writer.min_length = end - s + writer.pos;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is needed. See also comments on Rietveld: https://bugs.python.org/review/16334/diff/17685/Objects/unicodeobject.c.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, it's discussed. I know this may avoid unnecessary reallocation. But honestly I doubt how useful it could be.

@@ -6439,7 +6427,7 @@ PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
if (ch < 0x100) {
*p++ = (char) ch;
}
/* U+0000-U+00ff range: Map 16-bit characters to '\uHHHH' */
/* U+0100-U+ffff range: Map 16-bit characters to '\uHHHH' */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -3922,10 +3922,6 @@ PyUnicode_FSDecoder(PyObject* arg, void* addr)
}

if (PyUnicode_Check(path)) {
if (PyUnicode_READY(path) == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

if (unicode_decode_call_errorhandler_writer(
errors, &errorHandler,
"unicodeescape", message,
&starts, &end, &startinpos, &endinpos, &exc, &s,
&writer)) {
goto onError;
}
if (_PyUnicodeWriter_Prepare(&writer, writer.min_length, 127) < 0) {
goto onError;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even more important. Since WRITE_CHAR doesn't check the size of the output buffer, we need to allocate the space for writer.min_length = end - s + writer.pos characters past the last written character.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed code to update writer.min_length and to call _PyUnicodeWriter_Prepare(). I don't think that this change is correct. I wrote this code long time ago, and I don't recall the rationale, and the code was carefully written for best performances, and also for correctness. If the buffer is too small, you create a buffer overflow...

@zhangyangyu
Copy link
Member Author

I admit updating writer.min_length has its effect. It's something I am gonna restore. But I still don't understand why buffer overflow could happen. I'll wait for Serhiy's comment as an answer.

@zhangyangyu zhangyangyu dismissed stale reviews from serhiy-storchaka and vstinner March 31, 2017 03:08

request another round

@zhangyangyu
Copy link
Member Author

I dismissed your reviews to request another round. No offense.

@zhangyangyu
Copy link
Member Author

@serhiy-storchaka , does this look correct now? Or I still mix things up?

@serhiy-storchaka serhiy-storchaka self-requested a review April 27, 2017 11:32
@serhiy-storchaka
Copy link
Member

Created separate #5636 for incorrect use of _PyUnicodeWriter_Prepare(). After merging that PR and merging this PR with master the rest of it LGTM.

@bedevere-bot
Copy link

@zhangyangyu: Please replace # with GH- in the commit message next time. Thanks!

@zhangyangyu zhangyangyu deleted the unicode-cleanup branch February 13, 2018 10:33
@zhangyangyu
Copy link
Member Author

Thanks @serhiy-storchaka ! I was going to rebase it after getting home. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip news type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants