Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dEajL3kA
Copy link

@dEajL3kA dEajL3kA commented Jun 14, 2021

Until now, curl_getenv() internally uses GetEnvironmentVariableA() on Windows, even in Unicode build. This does not work correctly, if the environment variable contains characters that cannot be represented in the system's ANSI codepage. A typical exampel would be the HOME environment variable. It contains a string like C:\Users\Βαρουφάκης, depending on user name.

@bagder bagder added the Windows Windows-specific label Jun 14, 2021
@bagder
Copy link
Member

bagder commented Jun 14, 2021

libtool: compile:  x86_64-w64-mingw32-gcc -DHAVE_CONFIG_H -I../include -I../lib -I../lib -DBUILDING_LIBCURL -DCURL_STATICLIB -isystem C:/msys64/mingw64/include -isystem C:/msys64/mingw64/include -DWINVER=0x0600 -Werror-implicit-function-declaration -O2 -Wno-system-headers -Wenum-conversion -Werror -pedantic-errors -MT libcurl_la-getenv.lo -MD -MP -MF .deps/libcurl_la-getenv.Tpo -c getenv.c -o libcurl_la-getenv.o
getenv.c: In function 'GetEnvWin32':
getenv.c:45:14: error: returning 'void *' from a function with return type 'TCHAR' {aka 'char'} makes integer from pointer without a cast [-Wint-conversion]
   45 |       return NULL;
      |              ^~~~
getenv.c:56:14: error: returning 'void *' from a function with return type 'TCHAR' {aka 'char'} makes integer from pointer without a cast [-Wint-conversion]
   56 |       return NULL;
      |              ^~~~
getenv.c:61:14: error: returning 'TCHAR *' {aka 'char *'} from a function with return type 'TCHAR' {aka 'char'} makes integer from pointer without a cast [-Wint-conversion]
   61 |       return buf;
      |              ^~~
getenv.c: In function 'GetEnv':
getenv.c:87:10: error: returning 'TCHAR' {aka 'char'} from a function with return type 'char *' makes pointer from integer without a cast [-Wint-conversion]
   87 |   return GetEnvWin32(variable);
      |          ^~~~~~~~~~~~~~~~~~~~~

@jay
Copy link
Member

jay commented Jun 14, 2021

curl_getenv is a public api function that returns local codepage. I think there was some discussion whether to return UTF-8 here. We'd break existing legacy clients. (On second thought, I don't know if I should refer to this as legacy behavior. This is, as of now, expected behavior. It functions like the user's getenv).

@dEajL3kA
Copy link
Author

dEajL3kA commented Jun 14, 2021

We'd break existing legacy clients.

Maybe that would be the lesser of two evils 😈

Current behaviour, with the "official" curl binary (Unicode), is as follows (not good):

grafik

Or, maybe, internally use a different function that is Unicode aware?

@vszakats
Copy link
Member

vszakats commented Jun 15, 2021

Given that this function is already deprecated (since 2004), my initial thought was that we should rather just document the unexpected behaviour in UNICODE builds and leave it as-is.

But, the issue here is that this function is actively used inside curl (and libcurl), so the values returned from it will thus unexpectedly introduce 8-bit strings internally to UNICODE builds, instead of being UTF-8. That's most likely not what those callers (or users) expect. E.g. it may make .netrc and SSH keys being read from the wrong path for users with a username with non-ASCII characters, or a proxy will be misread if the URL has non-ASCII characters.

So to operate on UTF-8 as intended, either this function should be fixed (by breaking compatibility), or a new (internal-only) version introduced while leaving this one as-is, and upgrade to the new (UTF-8 capable) version internally in curl/libcurl.

@jay
Copy link
Member

jay commented Jun 16, 2021

The function should not be changed or broken. I'll add a GetEnv in the curl tool that returns UTF-8 for Windows Unicode builds.

jay added a commit to jay/curl that referenced this pull request Jul 10, 2021
- For Windows Unicode builds add tool_getenv_utf8 to get Unicode UTF-8
  encoded environment variable values.

- Add tool_getenv_local to getenv in the current locale encoding. This
  is the equivalent of curl_getenv which is now banned in favor of
  tool_getenv.

- Map tool_getenv macro to tool_getenv_utf8 for Windows Unicode and
  tool_getenv_local otherwise.

- Similar to above, split homedir into homedir_utf8, homedir_local
  and a homedir macro which maps to one of the first two.

Background:

Windows does not support UTF-8 locale (or, not really). Because of that
our Windows Unicode builds continue to use the current locale, but
expect Unicode UTF-8 encoded paths for internal use. In other words file
operations by curl or libcurl to open / access / stat a file are
expected to have a Unicode path.

Complicating this is that dependencies, which may or may not be Unicode
builds, may or may not expect UTF-8 encoded Unicode for char * string
pathnames. For example, libcurl can use libssh or libssh2 as an SSH
library and pathnames need to be passed in the local encoding AFAICS.

Prior to this change the curl tool made a best effort to convert
pathnames from the command line to Unicode UTF-8 encoding, but did not
do so for those pathnames retrieved by environment variables via
curl_getenv, which is in the local encoding. This worked without much
incident but was incorrect.

Essentially, for Windows Unicode builds we need access to both local
encoded and UTF-8 encoded pathnames from the environment. Therefore,
curl_getenv is not sufficient. In order to make this obvious in the
code I've banned curl_getenv (via checksrc) in favor of new
tool_getenv_local, tool_getenv_utf8 and tool_getenv and made similar
changes to homedir, as described.

The end result is something like CURLOPT_SSH_KNOWNHOSTS which is passed
to the SSH library and always needs the local encoding (in other words,
even if it's a Windows Unicode build) can now be made by calling
homedir_local (which calls tool_getenv_local). But, if it's a path that
is used internally only then just homedir can be used (which calls
tool_getenv_utf8 for Windows Unicode, and tool_getenv_local otherwise).

Also of note is 765e060, which removed local encoding fallbacks and
is another reason to make these changes.

Fixes curl#7252
Closes curl#7281
jay added a commit to jay/curl that referenced this pull request Jul 16, 2021
- For Windows Unicode builds add Curl_getenv_utf8 to get Unicode UTF-8
  encoded environment variable values.

- Add Curl_getenv_local to get the environment variable in the current
  locale encoding.

- Map Curl_getenv macro to Curl_getenv_utf8 for Windows Unicode and
  Curl_getenv_local otherwise.

- Ban public getenv, curlx_getenv, and curl_getenv from being called in
  curl/libcurl in favor of Curl_getenv.

- Add functions to check if a environment variable exists:
  Curl_env_exist_utf8, Curl_env_exist_local and macro Curl_env_exist
  which maps to one of the first two.

- Similar to above, split homedir into homedir_utf8, homedir_local
  and a homedir macro which maps to one of the first two.

The Curl_getenv functions are curlx functions that are compiled in both
the lib and tool.

Background:

Windows does not use or support UTF-8 as the current locale (or, not
really). The Windows Unicode builds of curl/libcurl use the current
locale, but expect Unicode UTF-8 encoded paths for internal use such as
open, access and stat.

Prior to this change Windows Unicode builds of the curl tool made a best
effort to convert pathnames from the command line to Unicode UTF-8
encoding, but curl/libcurl did not do so for those pathnames retrieved
by environment variables via getenv/curl_getenv, which returns the local
encoding. This worked without much incident but was incorrect if the
path was going to be opened by curl/libcurl. However, it was correct for
dependencies that always expect paths in the local encoding.

Basically, for Windows Unicode builds we need access to environment
variables in both UTF-8 and current locale encoding. There is more
detail about this issue in the getenv.h comment block.

Also of note is 765e060, which removed local encoding fallbacks and
is another reason to make these changes.

Fixes curl#7252
Closes curl#7281
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
jay added a commit to jay/curl that referenced this pull request Aug 13, 2022
jay added a commit that referenced this pull request Aug 16, 2022
@jay
Copy link
Member

jay commented Aug 16, 2022

Closing, we aren't going to change curl_getenv. An attempt to add a different getenv for the curl tool that returns a UTF-8 encoded path in Windows Unicode builds has died and this is now known bug 5.14 Windows Unicode builds use homedir in current locale

@jay jay closed this Aug 16, 2022
jquepi pushed a commit to jquepi/curl.1.555 that referenced this pull request Oct 24, 2022
@vszakats vszakats added the Unicode Unicode, code page, character encoding label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Unicode Unicode, code page, character encoding Windows Windows-specific

Development

Successfully merging this pull request may close these issues.

5 participants