Thanks to visit codestin.com
Credit goes to lib.rs

#text #unicode-text #unicode #word #grapheme

unmaintained unic-segment

UNIC — Unicode Text Segmentation Algorithms

3 releases (breaking)

0.9.0 Mar 3, 2019
0.8.0 Jan 2, 2019
0.7.0 Feb 7, 2018

#589 in Internationalization (i18n)

Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App

661,712 downloads per month
Used in 68 crates (7 directly)

MIT/Apache

110KB
1.5K SLoC

UNIC — Unicode Text Segmentation Algorithms

A component of unic: Unicode and Internationalization Crates for Rust.

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).

Examples

assert_eq!(
    Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
    &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);

assert_eq!(
    Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
    &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);

assert_eq!(
    GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
    &[(0, ""), (3, ""), (6, "ö̲"), (11, "\r\n")]
);

fn has_alphanumeric(s: &&str) -> bool {
    s.chars().any(|ch| ch.is_alphanumeric())
}

assert_eq!(
    Words::new(
        "The quick (\"brown\") fox can't jump 32.3 feet, right?",
        has_alphanumeric,
    ).collect::<Vec<&str>>(),
    &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

assert_eq!(
    WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
    &[
        (0, "Brr"),
        (3, ","),
        (4, " "),
        (5, "it's"),
        (9, " "),
        (10, "29.3"),
        (14, "°"),
        (16, "F"),
        (17, "!")
    ]
);

UNIC — Unicode Text Segmentation Algorithms

Crates.io Documentation

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences.

Notes

Initial code for this component is based on unicode-segmentation.

Dependencies