Releases: clipperhouse/uax29
v2
A new, simplified major version, with breaking changes.
- New methods
FromString,FromReaderandFromBytes, which are renames ofNewSegmenter,NewScanner, etc. The thing you’re passing is more interesting than the iterator type. - Removed
Filter,TransformandAll. These were convenience methods, not offering any particular value that a Go programmer could not do themselves. Focus on this library’s purpose. - Kept
Joiners, where I do believe we can do things better than the caller.
Requires Go 1.20 or later (though I could negotiate that).
Full Changelog: v1.16.0...v2.0.0
v1.15.0
What's Changed
- Add range iterators for Go 1.23 by @clipperhouse in #28
- Joiners by @clipperhouse in #27
Full Changelog: v1.14.3...v1.15.0
v1.14.0
v1.13.0
What's Changed
- Update Unicode & Go versions by @clipperhouse in #24
- New optimizations by @clipperhouse in #25
Full Changelog: v1.12.5...v1.13.0
Transforms
Add ability to transform tokens, such as lowercasing, based on golang.org/x/text/transform.
The signature of SegmentAll() has changed, removing the variadic parameter at the end, a breaking change if you used that param. I’m not willing to go to v2 over that, hopefully not a problem. :)
Filters
Adds the ability to filter tokens. Use the Filter() method on Segmenter and Scanner.
Any func([]byte) bool can serve as a filter, with arbitrary logic. An example included filter is Wordlike, which removes whitespace and punctuation, returning only ‘words’ in the common sense.
See also the Contains() and Entirely() methods, which allow creation of filters based on Unicode categories.
New Segmenter
Introduces a new Segmenter type to tokenize []byte. Analogous to the existing Scanner, but does not require a io.Reader. Segmenter is maybe 10% faster, assuming you are operating on an existing []byte.
Also adds a SegmentAll([]byte) convenience method, if you’d like to do this as a one-liner and are not too concerned about allocations.
Handle more invalid UTF-8
Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
There are two basic tests in each package, called TestInvalidUTF8 and TestRandomBytes. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.
v1.6.5
Many performance improvements, over 2× faster than the v1.6.0 release.
v1.6.0
Replace range tables with tries, using the triegen package from x/text. Performance improvement looks to be 30-40%.
This is a breaking change for consumers that depended on those range tables, which were exported. For normal usage of calling the scanner, the API is unchanged.
This should be a new major version but I can’t be bothered with the v2 renaming, in terms of discoverability, etc. Hopefully OK for most users.