-
-
Notifications
You must be signed in to change notification settings - Fork 62
Enhanced transcribed language scripts #904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4b7f2be to
82e2122
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances support for transcribed languages and updates frequency handling across scripts and documentation.
- Add YAML-based layout parsing and grouping for transcribed entries in normalize-transcribed.py
- Add --prefer-higher flag and improved transcribed handling in inject-dictionary-frequencies.js
- Increase max frequency from 999 to 9999 and update padding to 4 digits
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/normalize-transcribed.py | Adds YAML layout parsing, groups by layout index pattern, and switches from 'chinese' to 'native' keys. |
| scripts/inject-dictionary-frequencies.js | Adds --prefer-higher flag, adjusts parsing for transcribed inputs, and changes keying for frequency lookups. |
| app/constants.gradle | Bumps MAX_WORD_FREQUENCY to 9999 to match new constraints. |
| app/build-dictionaries.gradle | Pads frequency prefixes to 4 digits for sorting consistency with new max. |
| CONTRIBUTING.md | Updates documented valid frequency range to 0–9999. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
82e2122 to
79e3dd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| matched = False | ||
| for symbol in sorted_symbols: | ||
| if latin.startswith(symbol, i): | ||
| index_seq.append(str(layout_dict[symbol])) |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concatenating numeric group indices without a delimiter can create ambiguous keys once indexes reach two digits (e.g., [1,10] vs [11,0] both become '110'). Use a non-ambiguous key, e.g., store integers and use a tuple.
| with open(yaml_path, encoding='utf-8') as f: | ||
| data = yaml.safe_load(f) | ||
|
|
||
| if "layout" not in data or not isinstance(data["layout"], list): |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the YAML file is empty or doesn't parse to a mapping, data will be None, and 'layout' in data will raise a TypeError. Guard for dict first: if not isinstance(data, dict) or 'layout' not in data or not isinstance(data['layout'], list):
| if "layout" not in data or not isinstance(data["layout"], list): | |
| if not isinstance(data, dict) or "layout" not in data or not isinstance(data["layout"], list): |
| const parts = line.split(DELIMITER); | ||
| const word = parts[0].toLocaleLowerCase(locale); | ||
| let frequency = parts.length > 1 ? Number.parseInt(parts[1]) : 0; | ||
| const wordId = transcribed && parts.length >= 2 ? `${parts[0]}${parts[1]}` : parts[0].toLocaleLowerCase(locale); |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The composite key concatenates the first two columns without a separator, which can collide (e.g., 'a'+'bc' vs 'ab'+'c'). Use a separator unlikely to appear in fields, such as DELIMITER.
| if (transcribed) { | ||
| const parts = line.split(DELIMITER); | ||
| word = parts[0]; | ||
| wordId = parts.length > 1 ? `${parts[0]}${parts[1]}` : parts[0]; |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same collision risk as above when forming the composite key for dictionary lines. Use a separator like DELIMITER to avoid ambiguity.
79e3dd7 to
e3148e5
Compare
No description provided.