Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Manishearth
Copy link
Member

The attached test was failing otherwise, (credit @sunfishcode). We basically did not take into account the decomposition width of the next character when buffering after a combining grapheme joiner.

We should probably add fuzz targets for this stuff using cargo-fuzz.

@Manishearth Manishearth merged commit c243fee into master Nov 18, 2020
@Manishearth
Copy link
Member Author

Merging since this could be a security issue, feel free to review ex post facto.

@Manishearth Manishearth deleted the streamsafe-reset branch November 18, 2020 03:44
if self.nonstarter_count + d.leading_nonstarters > MAX_NONSTARTERS {
self.buffer = Some(next_ch);
self.nonstarter_count = 0;
self.nonstarter_count += d.decomposition_len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the invariants around self.nonstarter_count aren't quite clear here. let's say we hit this condition where we want to emit a CGJ.

  1. we've emitted a bunch of characters that have only nonstarters, so L51 increments self.nonstarter_count, and we emit the characters at L57. self.nonstarter_count represents the number of consecutive nonstarters in previously emitted characters.
  2. we have a character next_ch with enough leading nonstarters to push us over MAX_NONSTARTERS. we buffer this character but also increment self.nonstarter_count by next_ch's full decomposition length. note that self.nonstarter_count > MAX_NONSTARTERS at this point.
  3. we emit a CGJ.
  4. next iteration, we notice we have a character buffered and return it immediately.
  5. we take the next character from the underlying iterator with self.nonstarter_count still exceeding MAX_NONSTARTERS.

note that we potentially never reset self.nonstarter_count back to below MAX_NONSTARTERS if our stream only has characters with nonstarters.

here's an alternate structure of the code that should hopefully make the invariants clearer:

// Take a buffered character first and then fall back to the underlying iterator.
let next_ch = self.buffer.take().or_else(|| self.iter.next())?;
let d = classify_nonstarters(next_ch);
if self.nonstarter_count + d.leading_nonstarters > MAX_NONSTARTERS {
    // Put this character that'd put us over the limit back in the buffer.
    self.buffer = Some(next_ch);
    self.nonstarter_count = 0;
    return Some(COMBINING_GRAPHEME_JOINER);
}
// Update our counter of trailing nonstarters in the characters emitted so far.
if d.leading_nonstarters == d.decomposition_len {
    self.nonstarter_count += d.decomposition_len;
} else {
    self.nonstarter_count = d.trailing_nonstarters;
}
Some(next_ch)

the main difference here is that we're updating our counter (L51) as normal when we buffer a character to emit a CGJ. that way, the invariant of self.nonstarter_count as number of trailing nonstarters in the stream of characters emitted so far can be maintained. the downside is that we're calling classify_nonstarters twice on buffered characters, but I think that should be okay.

it's been a while since I've looked at this code so let me know if that makes sense. if it's right, the bug introduced here isn't that bad in that we'll just emit more CGJs than necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see this got fixed in #65!

setting self.nonstarter_count = d.decomposition_len is technically correct but a bit subtle. this relies on the fact that if a character's decomposition has leading nonstarters, its decomposition must be entirely nonstarters. I'll submit a PR to make this a bit clearer with my suggestions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants