Codestin Search App

A new, simplified major version, with breaking changes.

New methods FromString, FromReader and FromBytes, which are renames of NewSegmenter, NewScanner, etc. The thing you’re passing is more interesting than the iterator type.
Removed Filter, Transform and All. These were convenience methods, not offering any particular value that a Go programmer could not do themselves. Focus on this library’s purpose.
Kept Joiners, where I do believe we can do things better than the caller.

Requires Go 1.20 or later (though I could negotiate that).

Full Changelog: v1.16.0...v2.0.0

@clipperhouse

What's Changed

Add range iterators for Go 1.23 by @clipperhouse in #28
Joiners by @clipperhouse in #27

Full Changelog: v1.14.3...v1.15.0

@clipperhouse

What's Changed

New “phrases” package by @clipperhouse in #26

Full Changelog: v1.13.0...v1.14.0

@clipperhouse

What's Changed

Update Unicode & Go versions by @clipperhouse in #24
New optimizations by @clipperhouse in #25

Full Changelog: v1.12.5...v1.13.0

Add ability to transform tokens, such as lowercasing, based on golang.org/x/text/transform.

The signature of SegmentAll() has changed, removing the variadic parameter at the end, a breaking change if you used that param. I’m not willing to go to v2 over that, hopefully not a problem. :)

Adds the ability to filter tokens. Use the Filter() method on Segmenter and Scanner.

Any func([]byte) bool can serve as a filter, with arbitrary logic. An example included filter is Wordlike, which removes whitespace and punctuation, returning only ‘words’ in the common sense.

See also the Contains() and Entirely() methods, which allow creation of filters based on Unicode categories.

Introduces a new Segmenter type to tokenize []byte. Analogous to the existing Scanner, but does not require a io.Reader. Segmenter is maybe 10% faster, assuming you are operating on an existing []byte.

Also adds a SegmentAll([]byte) convenience method, if you’d like to do this as a one-liner and are not too concerned about allocations.

Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two basic tests in each package, called TestInvalidUTF8 and TestRandomBytes. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

Many performance improvements, over 2× faster than the v1.6.0 release.

Replace range tables with tries, using the triegen package from x/text. Performance improvement looks to be 30-40%.

This is a breaking change for consumers that depended on those range tables, which were exported. For normal usage of calling the scanner, the API is unchanged.

This should be a new major version but I can’t be bothered with the v2 renaming, in terms of discoverability, etc. Hopefully OK for most users.

Releases: clipperhouse/uax29

v2

Uh oh!

v1.15.0

What's Changed

Contributors

Uh oh!

v1.14.0

What's Changed

Contributors

Uh oh!

v1.13.0

What's Changed

Contributors

Uh oh!

Transforms

Uh oh!

Filters

Uh oh!

New Segmenter

Uh oh!

Handle more invalid UTF-8

Uh oh!

v1.6.5

Uh oh!

v1.6.0

Uh oh!