Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives#6630
Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives#6630cmb69 wants to merge 3 commits intophp:PHP-7.4from
Conversation
The default encoding of filenames in a ZIP archive is IBM Code Page 437. Phar, however, only supports UTF-8 filenames. Therefore we have to mark non ASCII filenames as being stored in UTF-8 by setting the general purpose bit 11 (the language encoding flag). The effect of not setting this bit for non ASCII filenames can be seen in popular tools like 7-Zip and UnZip, but not when extracting the archives via ext/phar (which is agnostic to the filename encoding), or via ext/zip (which guesses the encoding). Thus we add a somewhat brittle low-level test case.
ext/phar/zip.c
Outdated
| memcpy(central.datestamp, local.datestamp, sizeof(local.datestamp)); | ||
| PHAR_SET_16(central.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0)); | ||
| PHAR_SET_16(local.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0)); | ||
| if (!is_ascii(entry->filename, entry->filename_len)) { |
There was a problem hiding this comment.
Just wondering, would just unconditionally setting the flag be fine? ASCII and UTF-8 are the same when only ASCII characters are used.
There was a problem hiding this comment.
I don't see a real problem doing this unconditionally; if a ZIP tool doesn't cater to that flag, there still shouldn't be a difference regarding ASCII only filenames. OTOH, setting the flag conditionally, wouldn't cause any behavioral change for ASCII only filenames.
There was a problem hiding this comment.
I pushed a commit which would set the flag unconditionally. I'm fine with either solution.
There was a problem hiding this comment.
Always setting the flag is less code, so if that works, let's go for it :)
The default encoding of filenames in a ZIP archive is IBM Code Page
437. Phar, however, only supports UTF-8 filenames. Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).
The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding). Thus we add a somewhat
brittle low-level test case.