Thanks to visit codestin.com
Credit goes to github.com

Skip to content

salt/utils/files.py is_text — false "is not text" results with UTF-8 #66706

@NMi-ru

Description

@NMi-ru

Description
is_text function gives false negative results (input text is flagged as non-text/binary) when being provided with UTF-8 text with multibyte characters.

Setup

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • [ X ] container (Kubernetes, Docker, containerd, etc. please specify) — LXC running in Proxmox VE
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • [ X ] classic packaging — RPM install
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior

An example of bad consequences would be the inability of the file.*'s diff to output a diff of our config file changes:

{{sls}}__files:
  file.recurse:
    - name: /config/bird/
    - source: salt://modules/router-int/files/

protocol-static4.txt

salt-call state.apply …

     Changes:
              ----------
              /config/bird/protocol-static4:
                  ----------
                  diff:
                      Replace text file with binary file

Expected behavior
is_text function should return True ("this is text" result) for all multibyte UTF-8 text files.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3007.1

Python Version:
        Python: 3.10.14 (main, Apr  3 2024, 21:30:09) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: 18.8.0
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.16.0
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: centos 9
        locale: utf-8
       machine: x86_64
       release: 6.5.13-1-pve
        system: Linux
       version: CentOS Stream 9

Additional context
My take on what's happening:

Non-ASCII UTF-8 characters (Cyrillic, for example) are multibyte. Example: capital Cyrillic "A" (А) is 0xD0 0x90.

"is_text" function gets its input, snips 512 bytes, then feeds it to the "decode" function:

def is_text(fp_, blocksize=512):

642: def is_text(fp_, blocksize=512):

655: block = fp_.read(blocksize)
or
661: block = fp2_.read(blocksize)

672: block.decode("utf-8")

674: except UnicodeDecodeError:

678: return float(len(nontext)) / len(block) <= 0.30

If we're our of luck, the 512-byte snip cuts our multibyte UTF-8 character in half, leaving only the first (0xD0, for example) character, which leads to invalid UTF-8 byte block (see lines 672/674), which in sequence may lead (with some probability, see line 678) to false "this is not text"/"binary" result.

Attached file (protocol-static4.txt) ends with 0xD0:

dd if=protocol-static4.txt bs=1 count=512 | hexdump
…
00001f0 d0be d0bb d0b3 d0be d0b3 20be d028 d0b4

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbroken, incorrect, or confusing behavior

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions