WiP New encoding for the Unicode database. #5289

sjrd · 2025-12-30T15:42:01Z

The Unicode database used by jl.Character.getType and the various jl.Character.isXYZ methods is now entirely generated by an sbt task.

The relevant methods are isolated in a dedicated object UnicodeData, to allow DCE without slowing down access to them.

More importantly, we use a new encoding that is both more code size-efficient and run-time-efficient.

For run-time, we store each range as a pair (firstCP, data) in a single array. The data contain the type and a bit-field of properties that cannot directly be derived from the type. A single, efficient method dataHasAnyFlag can test for a set of types and a set of properties. For example, isAlphabetic returns true when either the type is any of the five major XYZ_LETTER types or LETTER_NUMBER, or when the property OtherAlphabetic is on.

For code size, we compress each pair as a single Int. It contains the data (minus "computed" fields of the bit-field that are computed at load time) as well as a diff from the previous range's firstCP to the current one. Some diffs are too large to fit in the available bits, and are special-cased.

Since the logic is bit hackery-heavy, we also generate exhaustive test data with straightforward logic.

This allows to store constant data arrays in `val`s, without negative impact.

That speeds up access to it. Since it is a constant array value, it is dce'ed if unused.

We now generate the Unicode database used by `jl.Character.getType` and the various `jl.Character.isXYZ` methods in an sbt task. The relevant methods are isolated in a dedicated object `UnicodeData`, to allow DCE without slowing down access to them. More importantly, we use a new encoding that is both more code size-efficient and run-time-efficient. For run-time, we store each range as a clever bit field: firstCP (21) ++ props (5) ++ alternatingTypes flag (1) ++ type (5) which, remarkably, adds up to 32 bits. There are 5 boolean properties that cannot be derived from the type field, but that are used in conjunction with the type field to implement some of the `isXYZ` methods. A single, efficient method `dataHasAnyFlag` can test for a set of types and a set of properties. For example, `isAlphabetic` returns `true` when either the type is any of the five major `XYZ_LETTER` types or `LETTER_NUMBER`, or when the property `OtherAlphabetic` is on. For code size, we further compress the `firstCP` field as a *diff* from the previous code point. We expand them in-place during the constructor of `UnicodeData`. Two `isXYZ` methods, namely `isIdeographic` and `isMirrored`, only rely on a boolean property, and not on the type. Since those properties are not even very correlated to the ranges used for the main data, they receive their independent array of ranges. Since the logic is bit hackery-heavy, we also generate exhaustive test data with straightforward logic.

sjrd added 3 commits December 30, 2025 12:30

Opt: More simplifyOnlyInterestedInMask.

eb2e666

Treat ArrayValue as trivially side-effect-free.

7a2088b

This allows to store constant data arrays in `val`s, without negative impact.

Opt: Make Character.nonASCIIZeroDigitCodePoints a non-lazy val.

9fee3c4

That speeds up access to it. Since it is a constant array value, it is dce'ed if unused.

sjrd force-pushed the unicode-data branch 8 times, most recently from a368c8f to 90c2125 Compare December 31, 2025 13:13

sjrd force-pushed the unicode-data branch from 90c2125 to 0c0d6ac Compare December 31, 2025 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WiP New encoding for the Unicode database. #5289

WiP New encoding for the Unicode database. #5289

sjrd commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WiP New encoding for the Unicode database. #5289

Are you sure you want to change the base?

WiP New encoding for the Unicode database. #5289

Conversation

sjrd commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sjrd commented Dec 30, 2025 •

edited

Loading