Correctly reset streamsafe iterator #62

Manishearth · 2020-11-18T03:43:31Z

The attached test was failing otherwise, (credit @sunfishcode). We basically did not take into account the decomposition width of the next character when buffering after a combining grapheme joiner.

We should probably add fuzz targets for this stuff using cargo-fuzz.

Manishearth · 2020-11-18T03:44:00Z

Merging since this could be a security issue, feel free to review ex post facto.

sujayakar · 2020-11-30T01:51:19Z

src/stream_safe.rs

        if self.nonstarter_count + d.leading_nonstarters > MAX_NONSTARTERS {
            self.buffer = Some(next_ch);
-            self.nonstarter_count = 0;
+            self.nonstarter_count += d.decomposition_len;


hmm, the invariants around self.nonstarter_count aren't quite clear here. let's say we hit this condition where we want to emit a CGJ.

we've emitted a bunch of characters that have only nonstarters, so L51 increments self.nonstarter_count, and we emit the characters at L57. self.nonstarter_count represents the number of consecutive nonstarters in previously emitted characters.

we have a character next_ch with enough leading nonstarters to push us over MAX_NONSTARTERS. we buffer this character but also increment self.nonstarter_count by next_ch's full decomposition length. note that self.nonstarter_count > MAX_NONSTARTERS at this point.

we emit a CGJ.

next iteration, we notice we have a character buffered and return it immediately.

we take the next character from the underlying iterator with self.nonstarter_count still exceeding MAX_NONSTARTERS.

note that we potentially never reset self.nonstarter_count back to below MAX_NONSTARTERS if our stream only has characters with nonstarters.

here's an alternate structure of the code that should hopefully make the invariants clearer:

// Take a buffered character first and then fall back to the underlying iterator. let next_ch = self.buffer.take().or_else(|| self.iter.next())?; let d = classify_nonstarters(next_ch); if self.nonstarter_count + d.leading_nonstarters > MAX_NONSTARTERS { // Put this character that'd put us over the limit back in the buffer. self.buffer = Some(next_ch); self.nonstarter_count = 0; return Some(COMBINING_GRAPHEME_JOINER); } // Update our counter of trailing nonstarters in the characters emitted so far. if d.leading_nonstarters == d.decomposition_len { self.nonstarter_count += d.decomposition_len; } else { self.nonstarter_count = d.trailing_nonstarters; } Some(next_ch)

the main difference here is that we're updating our counter (L51) as normal when we buffer a character to emit a CGJ. that way, the invariant of self.nonstarter_count as number of trailing nonstarters in the stream of characters emitted so far can be maintained. the downside is that we're calling classify_nonstarters twice on buffered characters, but I think that should be okay.

it's been a while since I've looked at this code so let me know if that makes sense. if it's right, the bug introduced here isn't that bad in that we'll just emit more CGJs than necessary.

ah, I see this got fixed in #65!

setting self.nonstarter_count = d.decomposition_len is technically correct but a bit subtle. this relies on the fact that if a character's decomposition has leading nonstarters, its decomposition must be entirely nonstarters. I'll submit a PR to make this a bit clearer with my suggestions above.

Manishearth added 2 commits November 17, 2020 19:39

Reset stream safe iterator to buffered character when outputting CGJ

0786dc0

Add test for streamsafe iterator

a558091

Manishearth requested a review from sujayakar November 18, 2020 03:43

Manishearth merged commit c243fee into master Nov 18, 2020

Manishearth deleted the streamsafe-reset branch November 18, 2020 03:44

sujayakar reviewed Nov 30, 2020

View reviewed changes

sujayakar mentioned this pull request Nov 30, 2020

Update nonstarter_count correctly + add test for all nonstarters string #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly reset streamsafe iterator #62

Correctly reset streamsafe iterator #62

Uh oh!

Manishearth commented Nov 18, 2020

Uh oh!

Manishearth commented Nov 18, 2020

Uh oh!

sujayakar Nov 30, 2020

Uh oh!

sujayakar Nov 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Correctly reset streamsafe iterator #62

Correctly reset streamsafe iterator #62

Uh oh!

Conversation

Manishearth commented Nov 18, 2020

Uh oh!

Manishearth commented Nov 18, 2020

Uh oh!

sujayakar Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

sujayakar Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants