l10n: implement status language builtin#12106
l10n: implement status language builtin#12106danielrainer wants to merge 1 commit intofish-shell:masterfrom
status language builtin#12106Conversation
a747278 to
9c4d8dd
Compare
| { | ||
| streams.err.append(L!("fish was built with the `localize-messages` feature disabled. The `status locale` command is unavailable.\n")); | ||
| return Err(STATUS_CMD_ERROR); | ||
| } else { |
There was a problem hiding this comment.
the else after return is a bit ugly,
if it's implied by cfg_if, that's another (superficial) reason for moving
from #[cfg(not(feature = "localize-messages"))] to if !cfg(feature = "localize-messages")
There was a problem hiding this comment.
As it is implemented now, it wouldn't compile otherwise, because some of the gettext-related functions are only defined if the localize-messages feature is active. We could change that, but I think it is more robust to completely remove these functions when localization is disabled, since otherwise it could happen that someone tries to use the useless variant without realizing in a context where localize-messages is disabled.
I agree that the else after return is ugly, but I don't know a better way to implement this.
src/wutil/gettext.rs
Outdated
| } | ||
| wgettext_fmt!( | ||
| "Language specifiers appear repeatedly: %s\n", | ||
| format!("{:?}", self.duplicates) |
There was a problem hiding this comment.
:? uses Debug so it uses Rust syntax ([x, y] for slices, Rust string escape sequences etc).
We do that for FLOG output but for user-facing things it would probably make more sense to use shell-like syntax (space separated).
I think that's crate::common::escape().
(For things that might be legacy-encoded keys, we use char_to_symbol and its extension for byte slices (DisplayBytes))
Not sure if we have a convenient way to call it.
I guess it's not super important, but maybe we should add something like join_escaped_strings next to join_strings
(or lift it to an iterator-based interface)
There was a problem hiding this comment.
I primarily chose Debug syntax here because it is a simple way of showing a Vec. However, it also has the important advantage over plain, space-separated strings that it allows to distinguish between something like the following two variants:
set LANGUAGE a b
set LANGUAGE 'a b'This is especially important for the malformed case. Maybe we could use space-separation + quoting?
There was a problem hiding this comment.
space separating escaped strings is unambiguous for all possible input values,
but of course the fact that it's space-separation doesn't become obvious until there's at least two elements.
I think in this particular case, it's already obvious from the left-hand-side that it can be multiple values.
So as long as all of them are like this, shell syntax seems more appropriate.
If people want to parse it, they can even use read -lat tokens
Historically we haven't had a lot of need to output real machine-readable (JSON/TOML) data,
but one related thing that feels icky is the parsing of status build-info output in share/functions/__fish_posix_shell.fish (especially the spaces in the key name).
I'm sure we can find a more robust solution without needing full JSON.
There was a problem hiding this comment.
space separating escaped strings is unambiguous for all possible input values
So a b vs a\ b for my examples? That should work, although it does have the problem that it's less obvious that it's a list, even if there are multiple elements, IMO, and I don't find it particularly aesthetically pleasing. Having a format that supports automated parsing is certainly nice, though.
Another option would be putting each entry on its own line. That would make it obvious that there are multiple entries if there is more than one line. A single line would still be ambiguous without quoting/escaping. Parsing individual lines would be easier, but automatically determining which lines to parse would be harder.
status build-info
That's something where a machine-readable format would certainly help. If we can't find a format that's good enough for machine and human consumption we could also add something like a --json flag to the relevant commands.
There was a problem hiding this comment.
by default, string escape a\ b outputs 'a b' which looks a bit nicer I guess.
I don't have super strong opinions but we should be consistent across builtins/functions.
I haven't looked at much related things; functions for example uses comma-separation if stdout is a TTY,
and newline-separation if it isn't.
For status locale, being machine-readable is probably not a priority.
Separate lines would be okay if it looks better in the TTY (and we have enough vertical space).
We could make it automatically output JSON if stdout is not a TTY so people are heavily discouraged from parsing the human-readable output.. but JSON doesn't really sound like fish. Might be better to add more "get" subcommands that print one item per line (if we ever need them).
There was a problem hiding this comment.
Might be better to add more "get" subcommands that print one item per line
That's probably the best approach. Maybe one of
status locale get language-precedencestatus locale messages get language-precedencestatus locale get messages language-precedence
Depending on whether we want to keep options open for supporting other locale categories. Also not sure if we want several levels of subcommands or compress them into a single level by hyphenating.
For the default, not-necessarily-machine-readable format, if we go with space-separated escaped strings, should we use src/common.rs:escape?
There was a problem hiding this comment.
That's probably the best approach. Maybe one of
status locale get language-precedence status locale messages get language-precedence status locale get messages language-precedence
I realized that status message-locale sounds a bit outdated, the modern term is probably
UI language, so we could call it status language?
The numeric thing could go into status number-format, if ever.
Then we would have status language precedence-list and status language list,
though I don't think we have a need for those, so I wouldn't add them until we do.
Depending on whether we want to keep options open for supporting other locale categories. Also
not sure if we want several levels of subcommands or compress them into a single level by
hyphenating.
Both could work depending on how many subcommands there will be,
but it sounds like we don't need either yet.
Nested subcommands are new for us, so we'd want to make sure we don't break things
like completion and error messages. Maybe we can introduce a proper data structure for a subcommand.
In future, we should maybe switch to clap but I don't know how much work it would be to migrate to that without breaking relevant things.
For the default, not-necessarily-machine-readable format, if we go with space-separated
escaped strings, should we use src/common.rs:escape?
yes. So we'll need that for displaying SetLocaleLints,
and for having status language print the precedence list,
(and the list of all available languages? I think that would be fine to add for discoverability,
and it should be obvious that it's not really meant to be parsed)
There was a problem hiding this comment.
I made several changes in the version I just pushed:
- Lists of languages are now formatted as space-separated lists of shell-escaped strings as you suggested. I put a util function for this into
src/wutil/gettext.rs, but maybe there is a more suitable place for it, or the functionality already exists somewhere else. - The command is renamed to
language. I decided to stick with the multiple levels of subcommands, because I think it makes sense to group this, especially because it allows showing a smaller, more relevant set of completions. - Completions are added. These introduce some custom logic, which I think improve over the logic used for other
statuscompletions. We might want to extract these functions into global functions such that they are more widely available. The intended behavior is that oncestatus languagehas been entered, completions only suggest the relevant subcommands, and ifstatus language sethas been entered, only the available languages are suggested. Ideally we would filter out languages which have already been specified, but that's not implemented in the current version. - The
status language list-availablecommand is added. Primarily useful for completions. - We could add another subcommand for listing the active language precedence in a machine-readable format, but I don't see much use for that now, so as you say, I don't think we should add it now.
If you're ok with the current interface, I'll write some docs for it.
src/wutil/gettext.rs
Outdated
| )), | ||
| } | ||
| localizable_consts!( | ||
| LANGUAGE_LIST_VARIABLE_ORIGIN "The language list is set based on the %s environment variable.\n" |
There was a problem hiding this comment.
so today this is only about message locale,
but it could be used for other locale-related features (today that's only LC_NUMERIC AFAIK)?
I suppose "status locale" is probably the appropriate name even if we don't end up adding more than messages.
There was a problem hiding this comment.
Yes, that's something I also thought about. For getting the status, the naming is not that important IMO, but for setting it we should decide now if we want to use this builtin for localization stuff other than the message locale. If so, maybe we should be more specific in the subcommand names now, using e.g. status locale set-messages (or override-messages as you suggested). Maybe also status locale get-messages in addition to status locale where the former would then show only message-related locale info, whereas the latter would show all locale information.
f5ee93c to
e36bf34
Compare
|
in general, moving from variables to dedicated commands is a good idea.
Commands can print errors, their docs are easier to find and the serialization format is "shell commands" rather than a custom DSL.
"complete" has always worked that way, "abbr" does too since we added options.
With upcoming color variable changes (3e17b96) it would probably make sense to do it for those variables too.
Remove the format validation and fallback handling. Instead, only
check string equality and ignore items which don't have a catalog
with the exact same name.
This simplifies the implementation and is easy to understand for
users. The new approach also does not depend on our naming scheme
for catalogs (when using the builtin command).
so "status language-override de_DE" will fail but "status language-override de" will work?
I'm not sure I get it.
I agree that possibly changing the builtin command syntax later (even if breaking) is fine, especially since we can always add extra logic for full backwards compatibility if we want to.
|
Exactly. Since we can give proper feedback via warnings/error messages for commands, and we can add a way to list the available options, I think it makes sense to go with the simplest possible option, with no fallback logic. That should be easy to understand and use, even if my explanation in the comment might not have been clear. Adding completions for the command should also help. |
e36bf34 to
d00db3c
Compare
d00db3c to
135ed67
Compare
status locale builtinstatus language builtin
|
|
||
| **unset**: | ||
| Undoes the effects of the **set** subcommand. | ||
| Language settings will be taken from environment variables again. |
There was a problem hiding this comment.
I wonder if a user would expect "unset" to exit with status 1 if there was no override.
I guess that might make sense, but I don't think it's important.
We have such smart failure returns in many builtins (set -e somevar, string join \n),
which can be very surprising when writing a nontrivial amount of fish script (fortunately few users actually need to do that)
There was a problem hiding this comment.
Hm, I'm not sure I'd like the unset command to exit non-zero if it doesn't actually fail. If we want to provide a way to check if a language override is active, we should provide a dedicated command for that which does not modify any state. Is there any use to knowing whether the language was set previously with no way of accessing the value it was set to? With a command like this, I think seeing a non-zero exit status could be quite confusing.
I'd also argue that the other commands you mention shouldn't exit non-zero, but changing that has the potential to break things.
| fn is_c_locale(locale: &str) -> bool { | ||
| locale.starts_with('C') | ||
| } | ||
| if is_c_locale(locale) { |
There was a problem hiding this comment.
Maybe inline this function?
BTW I always get confused how this avoids false positives,
until I remember that valid locale specifiers start with lowercase letters.
So we don't need to check that there's a word boundary after C.
Maybe worth a comment. Looks like there exists a POSIX locale, I wonder if we should support it.
// Locale specifiers start lowercase, only known exceptions are 'C' and 'POSIX'.
if locale.starts_with('C') {
There was a problem hiding this comment.
The reason I haven't inlined it is to provide some context why we check the first character, but I guess a comment could do the same and be even more explicit.
According to https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_02, the POSIX and C locales should be synonyms. I think supporting POSIX is a good idea, but I'm not sure if it's better to add support in this commit, or do it separately. From a quick regex search, I haven't found any other places which read locale variables and check if it's the C locale. If there are indeed no other instances of this, I think adding support for the POSIX locale in this commit would be fine.
There was a problem hiding this comment.
I added a comment to explain this, and added support for the POSIX locale. Still haven't found any other place in the code base which reads locale variables to check if the C locale is active, so I think it makes sense to include this change here.
src/wutil/gettext.rs
Outdated
| let localization_state = gettext_impl::status_language(); | ||
| let mut result = WString::new(); | ||
| localizable_consts!( | ||
| LANGUAGE_LIST_VARIABLE_ORIGIN "The language list is set based on the %s environment variable.\n" |
There was a problem hiding this comment.
technically the variable needn't be part of the environment, I was wondering if we should drop "environment".
Though in practice it will and should be 99% of the time, so it's probably better to leave this.
There was a problem hiding this comment.
Fair point. I don't have much of a preference here.
src/wutil/gettext.rs
Outdated
| } | ||
| }; | ||
| result.push_utfstr(&wgettext!( | ||
| "The language list is set to the following value:" |
There was a problem hiding this comment.
maybe a brief Language list: or Active languages: would sound better?
This would imply a change to the "origin" line, maybe even collapse them to oneline:
# if no variable is set
Active languages (default):
# if LC_ALL=C
Active languages (from $LC_ALL):
# etc.
Active languages (from $LC_ALL): de
Active languages (from $LC_MESSAGES): de
Active languages (from $LANG): de
Active languages (from $LANGUAGE): de fr
Active languages (from `status language set`): de fr
Maybe that's too concise but you get the idea.
The fact that English is implicit is weird, especially in the first two cases.
But that's unrelated to the decision on wording, and probably not surprising.
There was a problem hiding this comment.
Yes, I like having this more concise as well.
Regarding implicit English, I agree that it's somewhat strange. With gettext, we also have the issue that there is a difference between messages taken from the source vs taken from the en catalog. The difference is fairly minor, mainly some fancy quotes in the en catalog IIRC. When we switch to Fluent, we need to decide how to handle this. I think using the msgids for English would be fine, and using the msgstrs where we have them would also be fine if we are not worried about some characters not rendering correctly on certain systems. If we really want to have a separate English catalog, we could also introduce something like default.ftl in addition to en.ftl, but I prefer the other options.
4be3219 to
8431fe3
Compare
Based on the discussion in fish-shell#11967 Introduce a `status language` builtin, which has subcommands for controlling and inspecting fish's message localization status. The motivation for this is that using only the established environment variables `LANGUAGE`, `LC_ALL`, `LC_MESSAGES`, and `LANG` can cause problems when fish interprets them differently from GNU gettext. In addition, these are not well-suited for users who want to override their normal localization settings only for fish, since fish would propagate the values of these variables to its child processes. Configuration via these variables still works as before, but now there is the `status language set` command, which allows overriding the localization configuration. If `status language set` is used, the language precedence list will be taken from its remaining arguments. Warnings will be shown for invalid arguments. Once this command was used, the localization related environment variables are ignored. To go back to taking the configuration from the environment variables after `status language set` was executed, users can run `status language unset`. Running `status language` without arguments shows information about the current message localization status, allowing users to better understand how their settings are interpreted by fish. The `status language list-available` command shows which languages are available to choose from, which is used for completions. This commit eliminates dependencies from the `gettext_impl` module to code in fish's main crate, allowing for extraction of this module into its own crate in a future commit. Closes fish-shell#12106
8431fe3 to
607f469
Compare
| // locale name. | ||
| // https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_02 | ||
| fn is_c_locale(locale: &str) -> bool { | ||
| locale.starts_with('C') || locale.starts_with("POSIX") |
There was a problem hiding this comment.
(this should technically have been in a separate commit. Same for the change to no stop canonicalizing LANGUAGE entries)
There was a problem hiding this comment.
Fair point about the POSIX locale, but the LANGUAGE variable entries, as well as the locale variables are still being canonicalized (in the sense that we strip off suffixes for locale variables and fall back from ll_CC to ll). Only the values set via status language set require exact matches.
|
|
||
| status language unset | ||
| status language | ||
| # CHECK: Active languages (from variable LC_MESSAGES): |
There was a problem hiding this comment.
the "from" phrasing is maybe not perfect, maybe something like "source: " would work better.
I'll queue this now since this is not worse than my suggestion, and I guess we don't have further changes planned.
There was a problem hiding this comment.
Yes, I also prefer source: . I can push the relevant changes if you'd like.
There was a problem hiding this comment.
maybe like this (also hiding the command from translators)
diff --git a/src/localization/mod.rs b/src/localization/mod.rs
index 814f600e9c..eac6ee206c 100644
--- a/src/localization/mod.rs
+++ b/src/localization/mod.rs
@@ -142,7 +142,7 @@
let localization_state = fish_gettext::status_language();
let mut result = WString::new();
localizable_consts!(
- LANGUAGE_LIST_VARIABLE_ORIGIN "from variable %s"
+ LANGUAGE_LIST_VARIABLE_ORIGIN "%s variable"
);
let origin_string = match localization_state.precedence_origin {
LanguagePrecedenceOrigin::Default => wgettext!("default").to_owned(),
@@ -153,10 +153,13 @@
wgettext_fmt!(LANGUAGE_LIST_VARIABLE_ORIGIN, "LANGUAGE")
}
LanguagePrecedenceOrigin::StatusLanguage => {
- wgettext!("from command `status language set`").to_owned()
+ wgettext_fmt!("%s command", "`status language set`")
}
};
- result.push_utfstr(&wgettext_fmt!("Active languages (%s):", origin_string));
+ result.push_utfstr(&wgettext_fmt!(
+ "Active languages (source: %s):",
+ origin_string
+ ));
append_space_separated_list(&mut result, &localization_state.language_precedence);
result.push('\n');
There was a problem hiding this comment.
Yes, that seems better. Are you going to apply the patch yourself?
Based on the discussion in
#11967
Introduce a
status languagebuiltin, which has subcommands forcontrolling and inspecting fish's message localization status.
The motivation for this is that using only the established environment
variables
LANGUAGE,LC_ALL,LC_MESSAGES, andLANGcan causeproblems when fish interprets them differently from GNU gettext.
In addition, these are not well-suited for users who want to override
their normal localization settings only for fish, since fish would
propagate the values of these variables to its child processes.
Configuration via these variables still works as before, but now there
is the
status language setcommand, which allows overriding thelocalization configuration.
If
status language setis used, the language precedence list will betaken from its remaining arguments.
Warnings will be shown for invalid arguments.
Once this command was used, the localization related environment
variables are ignored.
To go back to taking the configuration from the environment variables
after
status language setwas executed, users can runstatus language unset.Running
status languagewithout arguments shows information about thecurrent message localization status, allowing users to better understand
how their settings are interpreted by fish.
The
status language list-availablecommand shows which languages areavailable to choose from, which is used for completions.
This commit eliminates dependencies from the
gettext_implmodule tocode in fish's main crate, allowing for extraction of this module into
its own crate in a future commit.