Thanks to visit codestin.com
Credit goes to github.com

Skip to content

write FractionalUCA_blanked.txt #1155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 20, 2025
Merged

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented May 20, 2025

... like running blankweights.sed over FractionalUCA.txt.

The generated FractionalUCA.txt contains all of the data for building the CLDR root collation data. Diffing it across Unicode versions is nearly unusable, because inserting any character into the sort order changes the primary weights of every following character. And the fractional byte sequence weights are subject to manual and automatic tweaks in their computation.

Therefore, for many years I have used the blankweights.sed script to generate a modified file with “blanked weights” that preserves the sort order and the number of non-zero weights. I find diffing this file across versions valuable. See the UCA tools docs for some more details.

For example:

FractionalUCA.txt
04C1; [62 2C, 05, 9B][, 8C, 05]	# Cyrl Lu	[2837.0020.0008][0000.0026.0002]	* CYRILLIC CAPITAL LETTER ZHE WITH BREVE

FractionalUCA_blanked.txt
04C1; [pp pp, ss, tt][, ss, tt]	# Cyrl Lu	[pppp.ssss.0008][0000.ssss.0002]	* CYRILLIC CAPITAL LETTER ZHE WITH BREVE

With these code changes, the Java code directly writes FractionalUCA_blanked.txt, just like it has also already written a FractionalUCA_SHORT.txt file (which omits the inline comments with allkeys.txt weights and character types & names). It should make the UCA-CLDR update workflow easier.

I am adding this file to the CLDR root collation data files.

I suggest reviewing one commit at a time. I intend to rebase & merge the three commits.

echeran
echeran previously approved these changes May 20, 2025
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rslgtm. all comments are optional

echeran
echeran previously approved these changes May 20, 2025
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markusicu markusicu merged commit 59ca1ca into unicode-org:main May 20, 2025
16 checks passed
@markusicu markusicu deleted the fracuca-blanked branch May 20, 2025 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants