Call for Testing: Unicode email addresses.

Last month a call for input was sent concerning the introduction of Unicode email addresses for WordPress accounts (#31992). Initial support was merged in [62482]. Here is what you need to know in order to test this change on your sites and in your plugins and themes.

  • is_email() and sanitize_email() accept non-ASCII email addresses like grรฅ@grรฅ.org if the site databaseโ€™s charset is utf8mb4.
  • Support is added as an enhancementenhancement Enhancements are simple improvements to WordPress, such as the addition of a hook, a new feature, or an improvement to an existing feature. which can be disabled by removing the is_email and sanitize_email filters for wp_is_unicode_email and wp_sanitize_unicode_email, respectively.
  • A new class โ€” WP_Email_Address โ€” provides a structural view into email addresses so your code doesnโ€™t have to guess. It provides the local part, the domain part, and decodes Punycode translations in the domain part.

It should be possible, therefore, to create WordPress accounts with email addresses not previously allowed. In addition, email validation is updated to match the WHATWG email specification so that WordPress and an <input type=email> element will agree on what is and what isnโ€™t allowable.

The term โ€œUnicode email addressโ€ may be a bit ambiguous because there are two ways emails can be considered Unicode:

  • Unicode domain support has been supported for many years through Punycode encoding of the domain. This is an ASCII-encoded version of Unicode domains where the domain parts start with xn--, like xn--uist2j67d64zv30b.xn--ses554g as a stand-in for ๆ…•็”ฐๅณช้•ฟๅŸŽ.็ฝ‘ๅ€. Because the encoding is all ASCII, WordPress has implicitly supported Unicode domains without recognizing them. The change in [62482] decodes the domain parts so that WordPress and its plugins and themes can access either the ASCII representation (for circumstances like HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. attributes where software will read their value) or the Unicode representation (for circumstances like text nodes where human will read their value).
  • Unicode local part (mailbox) support has largely been absent from specifications and software until recently when most major email hosts started routing mail with UTF-8 mailboxes. WordPress previously rejected all addresses containing non-ASCII characters. It now accepts valid UTF-8 local parts. There has never been an ASCII-encoding of this part of the email address.

If your extension code expects email addresses to only contain ASCII bytes, they will need updating for WordPressโ€™ new Unicode email support. The easiest way to account for this is to use the new WP_Email_Address::from_string() and then access its getter methods.

// Generate an author link.

$email = WP_Email_Address::from_string( $provided_email );
if ( null === $email ) {
    return '';
}

$processor = new WP_HTML_Tag_Processor( '<a> </a>' );
$processor->next_tag();
$processor->set_attribute( 'href', "mailto:{$email->get_ascii_address()}" );
$processor->next_token();
$processor->set_modifiable_text( $email->get_unicode_address() );
return $processor->get_updated_html();

If your pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party. connects with a third party service using email addresses from WordPress, now is a good time to ensure that third party also properly supports Unicode email addresses. If not, you can disable Unicode email support with the following snippet.

// Disable Unicode email support until third-party integration supports them.

remove_filter( 'is_email', 'wp_is_unicode_email', 10 );
remove_filter( 'sanitize_email', 'wp_sanitize_unicode_email', 10 );
add_filter( 'is_email', 'wp_is_ascii_email', 10, 3 );
add_filter( 'sanitize_email', 'wp_sanitize_ascii_email', 10, 3 );

Thank you!

This change updates existing email validation and sanitization code and introduces new behaviors for an unbounded set of potential email addresses. Itโ€™s likely that unanticipated cases will arise, and with your feedback in these cases, this feature can be a successful part of WordPress 7.1.

Props

Thanks to @amykamala and @jorbin for reviewing this post!

#call-for-testing, #email, #unicode

Extending Unicode support in email addresses.

Eleven years ago, in Core-31992, someone proposed allowing non-US-ASCII email address support in WordPress. The software world has changed considerably since then: internationalized domain names and paths are uniformly handled in browsers, email systems support the wide range of Unicode characters as raw UTF-8, and UTF-8 is the only recommended text encoding for interchange between systems. This means that people are free to use their own names when communicating with others, whether they are Jake, Klรกra, เฆ†เฆฐเฆฟเฆฏเฆผเฆพ , เด…เดฎเตฝ, or any other name containing letters outside the A-Z range. Unfortuantely, WordPress has not kept up with these changes, and thatโ€™s what this post is all about.

This post is a request for comment on adding that support. There are a number of complications with potentially far-reaching implications.

TL;DR

  • WordPressโ€™ email sanitization is based on US-ASCII characters and needs to be relaxed to allow for valid UTF-8, but this introduces new risks, including but not limited to: confusable characters, equivalence through normalization, and non-visible characters.
  • Sites whose databases cannot store full UTF-8 may fail to save valid email addresses. This could be confusing to the site owner and to people attempting to sign up on the site unless properly communicated.
  • Any additional code that assumes emails are encoded as single-byte US-ASCII will need updating, specifically because it was always an invariant before that emails would not contain multi-byte Unicode characters. Filters may start seeing characters they believed were impossible to receive.

If you have experience with email issues, deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. email services, or know about certain critical aspects of this proposal, please share your thoughts here or in Core-31992.

Continue reading โ†’

#charset, #email, #unicode