Crate unic_segment
source · [−]Expand description
UNIC — Unicode Text Segmentation Algorithms
A component of unic
: Unicode and Internationalization Crates for Rust.
This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).
Examples
assert_eq!(
Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
&["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);
assert_eq!(
Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
&["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);
assert_eq!(
GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
&[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
);
fn has_alphanumeric(s: &&str) -> bool {
s.chars().any(|ch| ch.is_alphanumeric())
}
assert_eq!(
Words::new(
"The quick (\"brown\") fox can't jump 32.3 feet, right?",
has_alphanumeric,
).collect::<Vec<&str>>(),
&["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);
assert_eq!(
WordBounds::new("The quick (\"brown\") fox").collect::<Vec<&str>>(),
&["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);
assert_eq!(
WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
&[
(0, "Brr"),
(3, ","),
(4, " "),
(5, "it's"),
(9, " "),
(10, "29.3"),
(14, "°"),
(16, "F"),
(17, "!")
]
);
Structs
Cursor-based segmenter for grapheme clusters.
External iterator for grapheme clusters and byte offsets.
External iterator for a string’s grapheme clusters.
External iterator for word boundaries and byte offsets.
External iterator for a string’s word boundaries.
An iterator over the substrings of a string which, after splitting the string on word
boundaries, contain any characters with
the Alphabetic property, or with
General_Category=Number
.
Enums
An error return indicating that not enough content was available in the provided chunk to satisfy the query, and that more content must be provided.
Constants
UNIC component description.
UNIC component name.
UNIC component version.
The Unicode version of data