[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present

**Notice**
I hereby announce that my raw input is not :
- Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
- Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

**Provide the file**
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.

https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html

(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)

**Verbose output**

```
2024-10-02 08:40:59,849 | Level 5 | Detected declarative mark in sequence. Priority +1 given for latin_1.
2024-10-02 08:40:59,852 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.533000 %
2024-10-02 08:40:59,852 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-10-02 08:40:59,857 | Level 5 | We detected language [('English', 0.656), ('Hungarian', 0.5849), ('French', 0.578), ('Spanish', 0.5486), ('Norwegian', 0.5294), ('Dutch', 0.5243), ('Finnish', 0.5221), ('Indonesian', 0.5191), ('Italian', 0.5174), ('Estonian', 0.5152), ('Danish', 0.5047), ('Swedish', 0.4706), ('Slovene', 0.4669), ('Croatian', 0.4662), ('Portuguese', 0.4648), ('Czech', 0.4546), ('Romanian', 0.4492), ('German', 0.4409), ('Slovak', 0.4296), ('Turkish', 0.4224), ('Polish', 0.3995), ('Lithuanian', 0.3933), ('Vietnamese', 0.3714)] using latin_1
2024-10-02 08:40:59,857 | DEBUG | Encoding detection: latin_1 is most likely the one.
{
    "path": "/tmp/finnish-utf-8-latin-1-confusion.html",
    "encoding": "latin_1",
    "encoding_aliases": [
        "8859",
        "cp819",
        "csisolatin1",
        "ibm819",
        "iso8859",
        "iso8859_1",
        "iso_8859_1",
        "iso_8859_1_1987",
        "iso_ir_100",
        "l1",
        "latin",
        "latin1"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.533,
    "coherence": 65.6,
    "unicode_path": null,
    "is_preferred": true
}
```

**Expected encoding**

This should be UTF-8. One clue is that the output includes the word `PÃ¤Ã¤tÃ¶sehdotus` which is a mangled version of `Päätösehdotus`.

Most nontrivial Finnish text will include several instances of the character `ä` and possibly `ö`. Upper-case versions `Ä` and `Ö` are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these become

- ä → \xc3\xa4 → Ã¤
- ö → \xc3\xb6 → Ã¶
- Ä → \xc3\x84 → Ã and a control character, or Ã„
- Ä → \xc3\x96 → Ã and a control character, or Ã–

The characters Ã¤¶„ do not appear in normal Finnish text. Ã could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination Ã€. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)

**Desktop (please complete the following information):**
 - OS: MacOS 14.7
 - Python version 3.12.6
 - Package version 3.3.2

**Additional context**

My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions