-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fold:fix gnu test fold-zero-width.sh #9274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…scii_line Implement logic to increment column count in WidthMode::Characters, emitting output when width is reached. This ensures accurate line folding for multi-byte characters, enhancing Unicode support.
|
GNU testsuite comparison: |
- Added conditional check in fold_file function to call emit_output when col_count >= width - Ensures lines are properly wrapped based on byte or character width before final output flush - Improves handling of incomplete lines that need early breaking to respect the specified width
CodSpeed Performance ReportMerging #9274 will improve performance by 50.04%Comparing Summary
Benchmarks breakdown
Footnotes
|
In character width mode, emit output immediately after segments are added if column count exceeds width, preventing redundant flushes. Simplify the file folding logic by removing unnecessary conditional checks at the end, ensuring clean output writing. This fixes potential issues with extra line breaks or incorrect folding behavior.
…ability Refactor code in fold.rs to break lengthy if-condition statements across multiple lines in push_ascii_segment, process_utf8_line, and process_non_utf8_line functions. This improves code readability without changing functionality.
|
GNU testsuite comparison: |
…ory usage Introduce a STREAMING_FLUSH_THRESHOLD constant and helper functions (maybe_flush_unbroken_output, push_byte, push_bytes) to periodically flush the output buffer when it exceeds 8KB and no spaces are being tracked, preventing excessive memory consumption when processing large files. This refactor replaces direct buffer pushes with checks for threshold-based flushing.
|
Could you please add tests? |
|
GNU testsuite comparison: |
|
and please fix this regression: |
…d tests Remove conditional checks that incorrectly emitted output when column count reached width in character mode, ensuring proper folding of wide characters and handling of edge cases. Add comprehensive tests for wide characters, invalid UTF-8, zero-width spaces, and buffer boundaries to verify correct behavior. This prevents issues with multi-byte character folding where output was prematurely flushed, improving accuracy for Unicode input.
- Remove trailing empty lines in fold.rs - Compact multiline variable assignments in test_fold.rs for readability
…racters Add unicode-width crate to handle zero-width Unicode characters in fold utility. Introduced new test 'test_zero_width_data_line_counts' to verify correct wrapping in --characters mode for zero-width bytes and spaces, ensuring fold behaves consistently with character counts rather than visual width.
- Add bytecount dependency to Cargo.toml and Cargo.lock - Refactor newline_count function in test_fold.rs to use bytecount::count instead of manual iteration for better performance
|
GNU testsuite comparison: |
Modify the fold implementation to process input in buffered chunks rather than line-by-line reading, ensuring correct handling of multi-byte characters split across buffer boundaries. Add process_pending_chunk function and new streaming logic to fold_file for better performance on large files. Update tests accordingly.
Replace loop with early empty check by a while loop conditional on !pending.is_empty() for clarity. Restructure invalid UTF-8 error handling to first check if valid_up_to == 0, then process the valid prefix, improving code readability and flow without changing behavior.
Consolidate the assignment of the `valid` variable from multiple lines to a single line for improved code readability and adherence to style guidelines favoring concise declarations.
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
done |
|
#9328 just saw that this was succeeding, with both of these together the all of the fold tests will pass |
|
GNU testsuite comparison: |
… mode Only coalesce zero-width combining characters into base characters when folding by display columns (WidthMode::Columns). In character-counting mode, treat every scalar value as advancing the counter to match chars().count() semantics, preventing incorrect line breaking for characters with zero-width marks. This ensures consistent behavior across modes as verified by existing tests.
|
GNU testsuite comparison: |
|
@sylvestre add test and passed the GNU coreutils tests |
|
GNU testsuite comparison: |
| Ok(()) | ||
| } | ||
|
|
||
| fn maybe_flush_unbroken_output<W: Write>(ctx: &mut FoldContext<'_, W>) -> UResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some comments to explain what these functions are doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add
| const NL: u8 = b'\n'; | ||
| const CR: u8 = b'\r'; | ||
| const TAB: u8 = b'\t'; | ||
| const STREAMING_FLUSH_THRESHOLD: usize = 8 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document this magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add
|
GNU testsuite comparison: |
…vent unbounded buffering Add explanatory comments to constants and functions in fold.rs, detailing the 8 KiB threshold for flushing in streaming mode to avoid unbounded buffer growth, and clarifying line folding behavior with the -s option. Improves code readability without altering functionality.
|
GNU testsuite comparison: |
| let line_bytes = line.as_bytes(); | ||
| let mut iter = line.char_indices().peekable(); | ||
|
|
||
| while let Some((byte_idx, ch)) = iter.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move that into a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
| Ok(()) | ||
| } | ||
|
|
||
| fn process_pending_chunk<W: Write>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add
| Err(err) => { | ||
| if err.error_len().is_some() { | ||
| process_non_utf8_line(pending, ctx)?; | ||
| pending.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't be executed if the previous line fails, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
…nk handling - Extract UTF-8 character processing into a new `process_utf8_chars` function for better code organization - Add documentation to `process_pending_chunk` explaining its behavior with buffered bytes and invalid UTF-8 - Modify `process_pending_chunk` to properly handle the result of `process_non_utf8_line`, ensuring errors are propagated after clearing the buffer
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
… fold logic - Introduce STREAMING_FLUSH_THRESHOLD constant (8 KiB) to prevent unbounded buffer growth during streaming, ensuring memory efficiency when input lacks fold points. - Improve comments in emit_output, maybe_flush_unbroken_output, push_byte, and push_bytes functions for better clarity on folding behavior, whitespace handling, and buffer management. - Adjust last_space index rebasing logic in emit_output to correctly track whitespace positions after partial consumption, maintaining accurate breaks with -s flag.
src/uu/fold/src/fold.rs
Outdated
| fn push_byte<W: Write>(ctx: &mut FoldContext<'_, W>, byte: u8) -> UResult<()> { | ||
| // Append a single byte to the buffer and flush if it grows too large. | ||
| ctx.output.push(byte); | ||
| maybe_flush_unbroken_output(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you not flushing too often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
src/uu/fold/src/fold.rs
Outdated
| let mut output = Vec::new(); | ||
| let mut col_count = 0; | ||
| let mut last_space = None; | ||
| let mut pending = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending` buffer could benefit from pre-allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
src/uu/fold/src/fold.rs
Outdated
| return Ok(()); | ||
| } | ||
|
|
||
| if !ctx.output.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.output.is_empty() is already guaranteed to be false by the condition at line 338, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will definitely be true. I'll make the correction.
Updated push_byte to only append bytes to the buffer without triggering flush checks, removing the call to maybe_flush_unbroken_output and adjusting the comment to reflect the new behavior. This change simplifies the function's logic, potentially improving performance by deferring flushes to other parts of the code.
Remove unnecessary empty check in maybe_flush_unbroken_output to simplify logic and reduce overhead. Preallocate capacity for pending vector in fold_file to improve performance by minimizing reallocations.
|
GNU testsuite comparison: |
Implement logic to increment column count in WidthMode::Characters, emitting output when width is reached. This ensures accurate line folding for multi-byte characters, enhancing Unicode support.
related
#9127