Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Intl] Languages and ISO 639-2 three-letter language codes #33136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TerjeBr opened this issue Aug 13, 2019 · 10 comments
Closed

[Intl] Languages and ISO 639-2 three-letter language codes #33136

TerjeBr opened this issue Aug 13, 2019 · 10 comments
Labels

Comments

@TerjeBr
Copy link

TerjeBr commented Aug 13, 2019

In my previous PR Support ISO 3166-1 Alpha-3 country codes I added support not only for alpha-3 country codes, but also extended the Languages class to support ISO 639-2 three-letter language codes. My focus was on the country codes, and with the country codes it was easy to get it right because all countries have one alpha2 code and one alpha3 code.

For the languages things is a bit more complicated. The extension to the Languages class was made to just mirror the Countries class, and not enough thought went into it. Here are some of the problems I can now see (after thinking about it) that we have with the Languages:

Wrong terminology

For languages we have ISO 639-1 that cover two-letter codes, and ISO 639-2 that cover three letter codes. Nowhere in the ISO specifications do they talk about "alpha2" or "alpha3". Those terms are borrowed from the ISO 3166-1 spec that covers country/region codes. So anyone familiar with ISO 639-1 and ISO 639-2 will wonder why are we in the code have method names and variable names with "alpha2" and "alpha3" in them.

Missing languages

This is a more serious issue. Not all languages have an ISO 639-1 two letter code. Here are some examples of the exeptions:

  • 409 languages (out of 615) has only a three letter code, but no corresponding two letter code. F.ex.
 ace => Achinese
 ach => Acoli
 arz => Egyptian Arabic
  • 22 languages must be described with more than three letters
 en_AU => Australian English
 de_AT => Austrian German
 zh_Hans => Simplified Chinese
  • 4 languages has only a two letter code, and no three letter code
 no => Norwegian
 sh => Serbo-Croatian
 tl => Tagalog
 tw => Twi
  • Only 180 languages has a two-letter to three letter mapping.

Implications for Languages methods

  • Languages::getAlpha3Code: This is the only one that is not new. It throws MissingResourceException for all but the 180 languages in the last category. Does not seem right to me. May be it should accept as input all longer than 2 codes and return them unchanged if it is a valid language code?
  • Languages::getAlpha2Code: Same considerations as above. Currently throws MissingResourceException for all but the 180 languages that has a 2 letter code.
  • Languages::getAlpha3Codes(): Currently only returns 180 codes.
  • Languages::alpha3CodeExists: Now it only returns true for the above mentioned 180 codes.
  • Languages::getAlpha3Name: Throws MissingResourceException if the language is not among the 180.
  • Languages::getAlpha3Names: Only returns a list of 180 languages. By contrast Languages::getNames returns a list of 615 languages.

What to do

I am seeking input in this issue from others on what to do about this. Please comment below.

@ro0NL
Copy link
Contributor

ro0NL commented Aug 13, 2019

For the terminology part, i believe we're fine ...

ISO 639-1:2002, Codes for the representation of names of languages — Part 1: Alpha-2 code, is the first part of the ISO 639 series of international standards for language codes. Part 1 covers the registration of two-letter codes (https://en.wikipedia.org/wiki/ISO_639-1)

ISO 639-2:1998, Codes for the representation of names of languages — Part 2: Alpha-3 code, is the second part of the ISO 639 standard, which lists codes for the representation of the names of languages. The three-letter codes given for each language in this part of the standard are referred to as "Alpha-3" codes. (https://en.wikipedia.org/wiki/ISO_639-2)

22 languages must be described with more than three letters

These are locales, i figure ICU includes those when it provides a "specialized" language name. Im not sure from a "language selector" point of view those should be included. It's helpful for getting the "most specific language name" from a locale (thus language + script/country).

I'll have a closer look asap, to help investigating the data. I.e. im not sure getting less languages when switching to alpha 3 is a problem per se. That's simply what the spec defines today...

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

Should we not at least add the 409 languages with only a 3 letter code.

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

I am in my lunch break now, and can write some comments.

Of the 4 mysterious languages with only a two letter code, at least I can give you the full storey on the first one, since I am a Norwegian.

Norway has 2 offical written languages, "Norwegian Bokmål" and "Norwegian Nynorsk". What is spoken in all the dialects in Norway differs more than the differences of the two written languages. It is also the case that it is not uncommon that two persons speaking the same dialect, one of them prefers to write in Bokmål and the other prefers to write in Nynorsk. They are almost the same, and at one point in history we tried to unify them to one written language, but for some political reasons it failed.

That is why most of the time even Norwegians do not bother to make a dististinction between the two, they simply refer to the language as "Norwegian". (When they speak English)

So it explains why the ISO tables has 3 entries for Norwegian:

ISO 639-1 ISO 639-2 Name Comment
no nor Norwegian Used when you do not want to distinguish between Bokmål and Nynorsk
nb nob Norwegian Bokmål
nn nno Norwegian Nynorsk

To be clear: Norway has only 2 written languages, not 3.

I have no idea why the ISO 639-2 code "nor" was not there.

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

sh => Serbo-Croatian has 2 ISO 639-2 codes: "scr" and "scc".

All 3 are marked as deprecated. Source: https://en.wikipedia.org/wiki/Serbo-Croatian

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

tw => Twi has ISO 639-2 code "twi"

A comment in https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes says it is "Covered by macrolanguage [ak/aka]"

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

tl => Tagalog has ISO 639-2 code "tgl"
Source: https://en.wikipedia.org/wiki/Tagalog_language

It is also officially called "Filipino", ISO 639-2 code "fil".

@ro0NL
Copy link
Contributor

ro0NL commented Aug 13, 2019

I see, the "language codes" were never split between ISO 639-1 and ISO 639-2, it is a combination of both (holding the preferred ISO 639-2 codes)

This means our ISO 639-2 (alpha3) list is only based on the mapping, not the full list. I agree we should include each 3-letter code.

However, does it mean we only include 2-letter codes for "language codes" (thus alpha2)? I think we should keep the current combination, to provide a full list. But we could consider passing e.g a flag . ONLY_ALPHA2.

I have no idea why the ISO 639-2 code "nor" was not there.

This is a mapping/ordering issue, see https://github.com/unicode-org/icu/blob/513b0c20b0bc99cfe18d39bb9606a404d661afcc/icu4c/source/data/misc/metadata.txt#L1038-L1045 and https://github.com/unicode-org/icu/blob/513b0c20b0bc99cfe18d39bb9606a404d661afcc/icu4c/source/data/misc/metadata.txt#L1018-L1021 (not sure what's the prefered mapping actually).

From this i tend to believe we should exclude legacy codes 🤔

See also https://github.com/unicode-org/icu/blob/513b0c20b0bc99cfe18d39bb9606a404d661afcc/icu4c/source/data/misc/metadata.txt#L1454-L1461 for twi, it seems updated meanwhile.

This is puzzling :)

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

Anyway, I must do my work now. Will not have time to be involved in this for a while.

@xabbuh xabbuh added the Intl label Aug 13, 2019
@ro0NL
Copy link
Contributor

ro0NL commented Aug 13, 2019

I'll give it a try soon :)

alpha2 = all 2 letter codes from https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt

alpha3 = all 3 letter codes from https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt + all 3 letter codes https://github.com/unicode-org/icu/blob/master/icu4c/source/data/misc/metadata.txt where the replacement is a valid 2-letter code and reason{"overlong"}

@TerjeBr
Copy link
Author

TerjeBr commented Aug 13, 2019

Just some more clarification on Norwegian.

In Norway more than 85% of the people write Bokmål (Source: https://en.wikipedia.org/wiki/Norwegian_language)

I guess that is why someone decided to deprecate the ISO codes "no" and "nor" in favour of "nb" and "nob".

@fabpot fabpot closed this as completed Aug 18, 2019
fabpot added a commit that referenced this issue Aug 18, 2019
This PR was merged into the 4.4 branch.

Discussion
----------

[Intl] Full alpha3 language support

| Q             | A
| ------------- | ---
| Branch?       | 4.4
| Bug fix?      | yes
| New feature?  | no
| BC breaks?    | no     <!-- see https://symfony.com/bc -->
| Deprecations? | no
| Tests pass?   | yes    <!-- please add some, will be required by reviewers -->
| Fixed tickets | #33136
| License       | MIT
| Doc PR        | symfony/symfony-docs#... <!-- required for new features -->

I'll validate some more cases with tests.

Commits
-------

29aee2d [Intl] Full alpha3 language support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants