Enhanced transcribed language scripts #904

sspanak · 2025-10-16T12:18:15Z

No description provided.

scripts/normalize-transcribed.py

Copilot

Pull Request Overview

This PR enhances support for transcribed languages and updates frequency handling across scripts and documentation.

Add YAML-based layout parsing and grouping for transcribed entries in normalize-transcribed.py
Add --prefer-higher flag and improved transcribed handling in inject-dictionary-frequencies.js
Increase max frequency from 999 to 9999 and update padding to 4 digits

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
scripts/normalize-transcribed.py	Adds YAML layout parsing, groups by layout index pattern, and switches from 'chinese' to 'native' keys.
scripts/inject-dictionary-frequencies.js	Adds --prefer-higher flag, adjusts parsing for transcribed inputs, and changes keying for frequency lookups.
app/constants.gradle	Bumps MAX_WORD_FREQUENCY to 9999 to match new constraints.
app/build-dictionaries.gradle	Pads frequency prefixes to 4 digits for sorting consistency with new max.
CONTRIBUTING.md	Updates documented valid frequency range to 0–9999.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

scripts/normalize-transcribed.py

scripts/inject-dictionary-frequencies.js

scripts/normalize-transcribed.py

Copilot

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-19T10:35:25Z

scripts/normalize-transcribed.py

+            matched = False
+            for symbol in sorted_symbols:
+                if latin.startswith(symbol, i):
+                    index_seq.append(str(layout_dict[symbol]))


Concatenating numeric group indices without a delimiter can create ambiguous keys once indexes reach two digits (e.g., [1,10] vs [11,0] both become '110'). Use a non-ambiguous key, e.g., store integers and use a tuple.

Copilot · 2025-10-19T10:35:25Z

scripts/normalize-transcribed.py

+    with open(yaml_path, encoding='utf-8') as f:
+        data = yaml.safe_load(f)
+
+    if "layout" not in data or not isinstance(data["layout"], list):


If the YAML file is empty or doesn't parse to a mapping, data will be None, and 'layout' in data will raise a TypeError. Guard for dict first: if not isinstance(data, dict) or 'layout' not in data or not isinstance(data['layout'], list):

Suggested change

if "layout" not in data or not isinstance(data["layout"], list):

if not isinstance(data, dict) or "layout" not in data or not isinstance(data["layout"], list):

scripts/normalize-transcribed.py

Copilot · 2025-10-19T10:35:26Z

scripts/inject-dictionary-frequencies.js

 		const parts = line.split(DELIMITER);
-		const word = parts[0].toLocaleLowerCase(locale);
-		let frequency = parts.length > 1 ? Number.parseInt(parts[1]) : 0;
+		const wordId = transcribed && parts.length >= 2 ? `${parts[0]}${parts[1]}` : parts[0].toLocaleLowerCase(locale);


The composite key concatenates the first two columns without a separator, which can collide (e.g., 'a'+'bc' vs 'ab'+'c'). Use a separator unlikely to appear in fields, such as DELIMITER.

Copilot · 2025-10-19T10:35:26Z

scripts/inject-dictionary-frequencies.js

 		if (transcribed) {
 			const parts = line.split(DELIMITER);
-			word = parts[0];
+			wordId = parts.length > 1 ? `${parts[0]}${parts[1]}` : parts[0];


Same collision risk as above when forming the composite key for dictionary lines. Use a separator like DELIMITER to avoid ambiguity.

sspanak self-assigned this Oct 16, 2025

sspanak added languages Dictionary or language related issues technical Refactoring without user-facing or functional changes labels Oct 16, 2025

sspanak commented Oct 16, 2025

View reviewed changes

scripts/normalize-transcribed.py Outdated Show resolved Hide resolved

sspanak force-pushed the enhanced-transcribed-language-scripts branch from 4b7f2be to 82e2122 Compare October 19, 2025 10:17

sspanak marked this pull request as ready for review October 19, 2025 10:17

sspanak requested a review from Copilot October 19, 2025 10:17

Copilot AI reviewed Oct 19, 2025

View reviewed changes

sspanak added 2 commits October 19, 2025 13:32

increased the maximum word frequency from 999 to 9999

0dde19c

improve inject-frequencies

3997fcd

sspanak force-pushed the enhanced-transcribed-language-scripts branch from 82e2122 to 79e3dd7 Compare October 19, 2025 10:32

sspanak requested a review from Copilot October 19, 2025 10:33

Copilot AI reviewed Oct 19, 2025

View reviewed changes

fixed normalize-transcribed sorting textonyms incorrectly

e3148e5

sspanak force-pushed the enhanced-transcribed-language-scripts branch from 79e3dd7 to e3148e5 Compare October 19, 2025 10:44

sspanak merged commit fd604aa into master Oct 19, 2025
5 checks passed

sspanak deleted the enhanced-transcribed-language-scripts branch October 19, 2025 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhanced transcribed language scripts #904

Enhanced transcribed language scripts #904

Uh oh!

sspanak commented Oct 16, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if "layout" not in data or not isinstance(data["layout"], list):
	if not isinstance(data, dict) or "layout" not in data or not isinstance(data["layout"], list):

Uh oh!

Enhanced transcribed language scripts #904

Enhanced transcribed language scripts #904

Uh oh!

Conversation

sspanak commented Oct 16, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants