Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@sjrd
Copy link
Member

@sjrd sjrd commented Dec 30, 2025

The Unicode database used by jl.Character.getType and the various jl.Character.isXYZ methods is now entirely generated by an sbt task.

The relevant methods are isolated in a dedicated object UnicodeData, to allow DCE without slowing down access to them.

More importantly, we use a new encoding that is both more code size-efficient and run-time-efficient.

For run-time, we store each range as a pair (firstCP, data) in a single array. The data contain the type and a bit-field of properties that cannot directly be derived from the type. A single, efficient method dataHasAnyFlag can test for a set of types and a set of properties. For example, isAlphabetic returns true when either the type is any of the five major XYZ_LETTER types or LETTER_NUMBER, or when the property OtherAlphabetic is on.

For code size, we compress each pair as a single Int. It contains the data (minus "computed" fields of the bit-field that are computed at load time) as well as a diff from the previous range's firstCP to the current one. Some diffs are too large to fit in the available bits, and are special-cased.

Since the logic is bit hackery-heavy, we also generate exhaustive test data with straightforward logic.

sjrd added 3 commits December 30, 2025 12:30
This allows to store constant data arrays in `val`s, without
negative impact.
That speeds up access to it. Since it is a constant array value,
it is dce'ed if unused.
@sjrd sjrd force-pushed the unicode-data branch 8 times, most recently from a368c8f to 90c2125 Compare December 31, 2025 13:13
We now generate the Unicode database used by `jl.Character.getType`
and the various `jl.Character.isXYZ` methods in an sbt task.

The relevant methods are isolated in a dedicated object
`UnicodeData`, to allow DCE without slowing down access to them.

More importantly, we use a new encoding that is both more code
size-efficient and run-time-efficient.

For run-time, we store each range as a clever bit field:

  firstCP (21) ++ props (5) ++ alternatingTypes flag (1) ++ type (5)

which, remarkably, adds up to 32 bits.

There are 5 boolean properties that cannot be derived from the type
field, but that are used in conjunction with the type field to
implement some of the `isXYZ` methods. A single, efficient method
`dataHasAnyFlag` can test for a set of types and a set of properties.
For example, `isAlphabetic` returns `true` when either the type is
any of the five major `XYZ_LETTER` types or `LETTER_NUMBER`, or when
the property `OtherAlphabetic` is on.

For code size, we further compress the `firstCP` field as a *diff*
from the previous code point. We expand them in-place during the
constructor of `UnicodeData`.

Two `isXYZ` methods, namely `isIdeographic` and `isMirrored`, only
rely on a boolean property, and not on the type. Since those
properties are not even very correlated to the ranges used for the
main data, they receive their independent array of ranges.

Since the logic is bit hackery-heavy, we also generate exhaustive
test data with straightforward logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant