-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[Intl] Languages and ISO 639-2 three-letter language codes #33136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For the terminology part, i believe we're fine ...
These are locales, i figure ICU includes those when it provides a "specialized" language name. Im not sure from a "language selector" point of view those should be included. It's helpful for getting the "most specific language name" from a locale (thus language + script/country). I'll have a closer look asap, to help investigating the data. I.e. im not sure getting less languages when switching to alpha 3 is a problem per se. That's simply what the spec defines today... |
Should we not at least add the 409 languages with only a 3 letter code. |
I am in my lunch break now, and can write some comments. Of the 4 mysterious languages with only a two letter code, at least I can give you the full storey on the first one, since I am a Norwegian. Norway has 2 offical written languages, "Norwegian Bokmål" and "Norwegian Nynorsk". What is spoken in all the dialects in Norway differs more than the differences of the two written languages. It is also the case that it is not uncommon that two persons speaking the same dialect, one of them prefers to write in Bokmål and the other prefers to write in Nynorsk. They are almost the same, and at one point in history we tried to unify them to one written language, but for some political reasons it failed. That is why most of the time even Norwegians do not bother to make a dististinction between the two, they simply refer to the language as "Norwegian". (When they speak English) So it explains why the ISO tables has 3 entries for Norwegian:
To be clear: Norway has only 2 written languages, not 3. I have no idea why the ISO 639-2 code "nor" was not there. |
sh => Serbo-Croatian has 2 ISO 639-2 codes: "scr" and "scc". All 3 are marked as deprecated. Source: https://en.wikipedia.org/wiki/Serbo-Croatian |
tw => Twi has ISO 639-2 code "twi" A comment in https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes says it is "Covered by macrolanguage [ak/aka]" |
tl => Tagalog has ISO 639-2 code "tgl" It is also officially called "Filipino", ISO 639-2 code "fil". |
I see, the "language codes" were never split between ISO 639-1 and ISO 639-2, it is a combination of both (holding the preferred ISO 639-2 codes) This means our ISO 639-2 (alpha3) list is only based on the mapping, not the full list. I agree we should include each 3-letter code. However, does it mean we only include 2-letter codes for "language codes" (thus alpha2)? I think we should keep the current combination, to provide a full list. But we could consider passing e.g a flag .
This is a mapping/ordering issue, see https://github.com/unicode-org/icu/blob/513b0c20b0bc99cfe18d39bb9606a404d661afcc/icu4c/source/data/misc/metadata.txt#L1038-L1045 and https://github.com/unicode-org/icu/blob/513b0c20b0bc99cfe18d39bb9606a404d661afcc/icu4c/source/data/misc/metadata.txt#L1018-L1021 (not sure what's the prefered mapping actually). From this i tend to believe we should exclude legacy codes 🤔 See also https://github.com/unicode-org/icu/blob/513b0c20b0bc99cfe18d39bb9606a404d661afcc/icu4c/source/data/misc/metadata.txt#L1454-L1461 for This is puzzling :) |
Anyway, I must do my work now. Will not have time to be involved in this for a while. |
I'll give it a try soon :) alpha2 = all 2 letter codes from https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt alpha3 = all 3 letter codes from https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt + all 3 letter codes https://github.com/unicode-org/icu/blob/master/icu4c/source/data/misc/metadata.txt where the replacement is a valid 2-letter code and |
Just some more clarification on Norwegian. In Norway more than 85% of the people write Bokmål (Source: https://en.wikipedia.org/wiki/Norwegian_language) I guess that is why someone decided to deprecate the ISO codes "no" and "nor" in favour of "nb" and "nob". |
This PR was merged into the 4.4 branch. Discussion ---------- [Intl] Full alpha3 language support | Q | A | ------------- | --- | Branch? | 4.4 | Bug fix? | yes | New feature? | no | BC breaks? | no <!-- see https://symfony.com/bc --> | Deprecations? | no | Tests pass? | yes <!-- please add some, will be required by reviewers --> | Fixed tickets | #33136 | License | MIT | Doc PR | symfony/symfony-docs#... <!-- required for new features --> I'll validate some more cases with tests. Commits ------- 29aee2d [Intl] Full alpha3 language support
In my previous PR Support ISO 3166-1 Alpha-3 country codes I added support not only for alpha-3 country codes, but also extended the Languages class to support ISO 639-2 three-letter language codes. My focus was on the country codes, and with the country codes it was easy to get it right because all countries have one alpha2 code and one alpha3 code.
For the languages things is a bit more complicated. The extension to the Languages class was made to just mirror the Countries class, and not enough thought went into it. Here are some of the problems I can now see (after thinking about it) that we have with the Languages:
Wrong terminology
For languages we have ISO 639-1 that cover two-letter codes, and ISO 639-2 that cover three letter codes. Nowhere in the ISO specifications do they talk about "alpha2" or "alpha3". Those terms are borrowed from the ISO 3166-1 spec that covers country/region codes. So anyone familiar with ISO 639-1 and ISO 639-2 will wonder why are we in the code have method names and variable names with "alpha2" and "alpha3" in them.
Missing languages
This is a more serious issue. Not all languages have an ISO 639-1 two letter code. Here are some examples of the exeptions:
Implications for Languages methods
Languages::getAlpha3Code
: This is the only one that is not new. It throws MissingResourceException for all but the 180 languages in the last category. Does not seem right to me. May be it should accept as input all longer than 2 codes and return them unchanged if it is a valid language code?Languages::getAlpha2Code
: Same considerations as above. Currently throws MissingResourceException for all but the 180 languages that has a 2 letter code.Languages::getAlpha3Codes()
: Currently only returns 180 codes.Languages::alpha3CodeExists
: Now it only returns true for the above mentioned 180 codes.Languages::getAlpha3Name
: Throws MissingResourceException if the language is not among the 180.Languages::getAlpha3Names
: Only returns a list of 180 languages. By contrastLanguages::getNames
returns a list of 615 languages.What to do
I am seeking input in this issue from others on what to do about this. Please comment below.
The text was updated successfully, but these errors were encountered: