salt/utils/files.py is_text — false "is not text" results with UTF-8

**Description**
**is_text** function gives false negative results (input text is flagged as non-text/binary) when being provided with UTF-8 text with multibyte characters.


**Setup**
- [ ] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [ X ] container (Kubernetes, Docker, containerd, etc. please specify) — LXC running in Proxmox VE
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ X ] classic packaging — RPM install
- [ ] onedir packaging
- [ ] used bootstrap to install


**Steps to Reproduce the behavior**

An example of bad consequences would be the inability of the file.*'s diff to output a diff of our config file changes:

```
{{sls}}__files:
  file.recurse:
    - name: /config/bird/
    - source: salt://modules/router-int/files/
```

[protocol-static4.txt](https://github.com/user-attachments/files/16194411/protocol-static4.txt)

```
salt-call state.apply …

     Changes:
              ----------
              /config/bird/protocol-static4:
                  ----------
                  diff:
                      Replace text file with binary file
```

**Expected behavior**
is_text function should return True ("this is text" result) for all multibyte UTF-8 text files.

**Versions Report**
<details><summary>salt --versions-report</summary>
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

```yaml
Salt Version:
          Salt: 3007.1

Python Version:
        Python: 3.10.14 (main, Apr  3 2024, 21:30:09) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: 18.8.0
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.16.0
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: centos 9
        locale: utf-8
       machine: x86_64
       release: 6.5.13-1-pve
        system: Linux
       version: CentOS Stream 9
```
</details>

**Additional context**
My take on what's happening:

Non-ASCII UTF-8 characters (Cyrillic, for example) are multibyte. Example: capital Cyrillic "A" (А) is 0xD0 0x90.

"is_text" function gets its input, snips 512 bytes, then feeds it to the "decode" function:

https://github.com/saltstack/salt/blob/bfc78d7646fd12443337d5840dfb2927dd889f37/salt/utils/files.py#L642

```
642: def is_text(fp_, blocksize=512):

655: block = fp_.read(blocksize)
or
661: block = fp2_.read(blocksize)

672: block.decode("utf-8")

674: except UnicodeDecodeError:

678: return float(len(nontext)) / len(block) <= 0.30
```

If we're our of luck, the 512-byte snip cuts our multibyte UTF-8 character in half, leaving only the first (0xD0, for example) character, which leads to invalid UTF-8 byte block (see lines 672/674), which in sequence may lead (with some probability, see line 678) to false "this is not text"/"binary" result.

Attached file (protocol-static4.txt) ends with 0xD0:

```
dd if=protocol-static4.txt bs=1 count=512 | hexdump
…
00001f0 d0be d0bb d0b3 d0be d0b3 20be d028 d0b4
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

salt/utils/files.py is_text — false "is not text" results with UTF-8 #66706

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

salt/utils/files.py is_text — false "is not text" results with UTF-8 #66706

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions