-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
Description
is_text function gives false negative results (input text is flagged as non-text/binary) when being provided with UTF-8 text with multibyte characters.
Setup
- on-prem machine
- VM (Virtualbox, KVM, etc. please specify)
- VM running on a cloud service, please be explicit and add details
- [ X ] container (Kubernetes, Docker, containerd, etc. please specify) — LXC running in Proxmox VE
- or a combination, please be explicit
- jails if it is FreeBSD
- [ X ] classic packaging — RPM install
- onedir packaging
- used bootstrap to install
Steps to Reproduce the behavior
An example of bad consequences would be the inability of the file.*'s diff to output a diff of our config file changes:
{{sls}}__files:
file.recurse:
- name: /config/bird/
- source: salt://modules/router-int/files/
salt-call state.apply …
Changes:
----------
/config/bird/protocol-static4:
----------
diff:
Replace text file with binary file
Expected behavior
is_text function should return True ("this is text" result) for all multibyte UTF-8 text files.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 3007.1
Python Version:
Python: 3.10.14 (main, Apr 3 2024, 21:30:09) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: 18.8.0
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.4
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.16.0
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir
System Versions:
dist: centos 9
locale: utf-8
machine: x86_64
release: 6.5.13-1-pve
system: Linux
version: CentOS Stream 9Additional context
My take on what's happening:
Non-ASCII UTF-8 characters (Cyrillic, for example) are multibyte. Example: capital Cyrillic "A" (А) is 0xD0 0x90.
"is_text" function gets its input, snips 512 bytes, then feeds it to the "decode" function:
Line 642 in bfc78d7
| def is_text(fp_, blocksize=512): |
642: def is_text(fp_, blocksize=512):
655: block = fp_.read(blocksize)
or
661: block = fp2_.read(blocksize)
672: block.decode("utf-8")
674: except UnicodeDecodeError:
678: return float(len(nontext)) / len(block) <= 0.30
If we're our of luck, the 512-byte snip cuts our multibyte UTF-8 character in half, leaving only the first (0xD0, for example) character, which leads to invalid UTF-8 byte block (see lines 672/674), which in sequence may lead (with some probability, see line 678) to false "this is not text"/"binary" result.
Attached file (protocol-static4.txt) ends with 0xD0:
dd if=protocol-static4.txt bs=1 count=512 | hexdump
…
00001f0 d0be d0bb d0b3 d0be d0b3 20be d028 d0b4