(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)
2024-10-02 08:40:59,849 | Level 5 | Detected declarative mark in sequence. Priority +1 given for latin_1.
2024-10-02 08:40:59,852 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.533000 %
2024-10-02 08:40:59,852 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-10-02 08:40:59,857 | Level 5 | We detected language [('English', 0.656), ('Hungarian', 0.5849), ('French', 0.578), ('Spanish', 0.5486), ('Norwegian', 0.5294), ('Dutch', 0.5243), ('Finnish', 0.5221), ('Indonesian', 0.5191), ('Italian', 0.5174), ('Estonian', 0.5152), ('Danish', 0.5047), ('Swedish', 0.4706), ('Slovene', 0.4669), ('Croatian', 0.4662), ('Portuguese', 0.4648), ('Czech', 0.4546), ('Romanian', 0.4492), ('German', 0.4409), ('Slovak', 0.4296), ('Turkish', 0.4224), ('Polish', 0.3995), ('Lithuanian', 0.3933), ('Vietnamese', 0.3714)] using latin_1
2024-10-02 08:40:59,857 | DEBUG | Encoding detection: latin_1 is most likely the one.
{
"path": "/tmp/finnish-utf-8-latin-1-confusion.html",
"encoding": "latin_1",
"encoding_aliases": [
"8859",
"cp819",
"csisolatin1",
"ibm819",
"iso8859",
"iso8859_1",
"iso_8859_1",
"iso_8859_1_1987",
"iso_ir_100",
"l1",
"latin",
"latin1"
],
"alternative_encodings": [],
"language": "English",
"alphabets": [
"Basic Latin",
"Control character",
"Latin-1 Supplement"
],
"has_sig_or_bom": false,
"chaos": 0.533,
"coherence": 65.6,
"unicode_path": null,
"is_preferred": true
}
This should be UTF-8. One clue is that the output includes the word Päätösehdotus which is a mangled version of Päätösehdotus.
The characters ä¶„ do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)
My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.
Notice
I hereby announce that my raw input is not :
Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.
https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html
(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)
Verbose output
Expected encoding
This should be UTF-8. One clue is that the output includes the word
Päätösehdotuswhich is a mangled version ofPäätösehdotus.Most nontrivial Finnish text will include several instances of the character
äand possiblyö. Upper-case versionsÄandÖare possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these becomeThe characters ä¶„ do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)
Desktop (please complete the following information):
Additional context
My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.