-
Notifications
You must be signed in to change notification settings - Fork 358
Reimplement idna on top of ICU4X #923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 25 commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
a8977a8
Reimplement idna on top of ICU4X
hsivonen 09765af
Add an even faster lower-case ASCII letter path to avoid regressing p…
hsivonen 7e929ce
Comments and verify_dns_length tweak
hsivonen f413387
Parametrize internal vs. external Punycode caller; restore external A…
hsivonen 71c03b9
Add bench for to_ascii on an already-Punycode name
hsivonen 9af00cb
Avoid re-encoding Punycode when possible
hsivonen dc8f301
Pass through the input slice in many more cases
hsivonen 41e0192
Add testing for the simultaneous mode
hsivonen 41f2107
Omit the invalid domain character check on the url side
hsivonen 4d7d41a
Document that Punycode labels must result in non-ASCII
hsivonen 98ca752
Rename files called uts46.rs to deprecated.rs
hsivonen 4bbabe9
Rename uts46bis to uts46
hsivonen 7dc0082
Tweak docs
hsivonen f8eb96e
Avoid useless copying and useless UTF-8 decode
hsivonen eb6e3d5
Use inline(never) to optimize binary size
hsivonen ce3d4d1
Split CheckHyphens into a separate concern form the ASCII deny list
hsivonen 6672161
Make the ASCII deny list customizable
hsivonen 90fe4b3
Better docs and top-level functions
hsivonen 50381ff
Parameter for VerifyDNSLength
hsivonen 8268c5a
Restore support for transitional processing to minimize breakage
hsivonen 999bef4
In the deprecated API, use empty deny list with use_std3_ascii_rules=…
hsivonen b277c85
Tweak docs
hsivonen 980348c
Docs, rename AsciiDenyList::WHATWG to ::URL, tweak top-level functions
hsivonen 4efd589
Use idna crate top-level function in the url crate to dogfood the top…
hsivonen da6cf50
Add an Usage section to the README
hsivonen d938024
Add an early return to map_transitional for readability
hsivonen 679edb9
Document internal vs. external Punycode caller differences
hsivonen 4f605c9
Per discussion with Valentin, revert deprecated API to the old behavi…
hsivonen bbf4308
Add comments about not fixing deprecated API
hsivonen e842dae
Merge branch 'main' into icu4x
hsivonen 6690c49
Add a comment explaining FailFast in deprecated.rs
hsivonen 38cedad
For future-proofing, add compiled_data cargo feature (currently alway…
hsivonen 52137e7
Remove remark about spec violation by making root dot permissibility …
hsivonen 081f44b
Clarify README about IDNA 2003/2008
hsivonen aaa7a40
Add a historical remark to the README
hsivonen 8b03034
Fix typo
hsivonen c8a4bd3
Depend on crates.io versions of icu_normalizer and icu_properties
hsivonen be3db8e
Address clippy lints
hsivonen 6020673
Update versions
hsivonen 245c514
Increment dependency versions
hsivonen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# `idna` | ||
|
||
IDNA library for Rust implementing [UTS 46: Unicode IDNA Compatibility Processing](https://www.unicode.org/reports/tr46/) as parametrized by the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna). | ||
|
||
## What it does | ||
|
||
* An implementation of UTS 46 is provided, with configurable ASCII deny list (e.g. STD3 or WHATWG rules). | ||
* A callback mechanism is provided for pluggable logic for deciding if a label is deemed potentially too misleading to render as Unicode in a user interface. | ||
* Errors are marked as U+FFFD REPLACEMENT CHARACTERs in Unicode output so that locations of errors may be illustrated to the user. | ||
|
||
## What it does not do | ||
|
||
* There is no default/sample policy provided for the callback mechanism mentioned above. | ||
* Earlier variants of IDNA (2003, 2008) are not implemented—only UTS 46. | ||
* There is no API for categorizing errors beyond there being an error. | ||
* Checks that are configurable in UTS 46 but that the WHATWG URL Standard always set a particular way (regardless of the _beStrict_ flag in the URL Standard) cannot be configured (with the exception of the old deprecated API supporting transitional processing). | ||
|
||
## Usage | ||
|
||
Apps that need to prepare a hostname for usage in protocols are likely to only need the top-level function `domain_to_ascii_cow` with `AsciiDenyList::URL` as the second argument. Note that this rejects IPv6 addresses, so before this, you need to check if the first byte of the input is `b'['` and, if it is, treat the input as an IPv6 address instead. | ||
|
||
Apps that need to display host names to the user should use `uts46::Uts46::to_user_interface`. The _ToUnicode_ operation is rarely appropriate for direct application usage. | ||
|
||
## Known spec violations | ||
|
||
* The `verify_dns_length` behavior that this crate implements allows a trailing dot in the input as required by the UTS 46 test suite despite the UTS 46 spec saying that this isn't allowed. | ||
|
||
## Breaking changes since 0.5.0 | ||
|
||
* IDNA 2008 rules are no longer supported. Attempting to enable them panics immediately. | ||
* `check_hyphens` now also rejects the hyphen in the third and fourth position in a label. | ||
* `domain_to_ascii_strict` now performs the _CheckHyphens_ check (matching previous documentation). | ||
* The ContextJ rules are now implemented and always enabled, so input that fails those rules is rejected. | ||
* The `Idna::to_ascii_inner` method has been removed. It didn't make sense as a public method, since callers were unable to figure out if there were errors. (A GitHub search found no callers for this method.) | ||
* Punycode labels whose decoding does not yield any non-ASCII characters are now treated as being in error. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
// Copyright 2013-2014 The rust-url developers. | ||
// | ||
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or | ||
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license | ||
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your | ||
// option. This file may not be copied, modified, or distributed | ||
// except according to those terms. | ||
|
||
//! Deprecated API for [*Unicode IDNA Compatibility Processing* | ||
//! (Unicode Technical Standard #46)](http://www.unicode.org/reports/tr46/) | ||
#![allow(deprecated)] | ||
|
||
use alloc::borrow::Cow; | ||
use alloc::string::String; | ||
|
||
use crate::uts46::*; | ||
use crate::Errors; | ||
|
||
/// Performs preprocessing equivalent to UTS 46 transitional processing | ||
/// if `transitional` is `true`. If `transitional` is `false`, merely | ||
/// lets the input pass through as-is. | ||
/// | ||
/// The output of this function is to be passed to [`Uts46::process`]. | ||
/// | ||
/// Deprecated, since this functionality is deprecated in UTS 46 itself, | ||
/// and none of Firefox, Safari, or Chrome use transitional processing. | ||
#[deprecated] | ||
fn map_transitional(domain: &str, transitional: bool) -> Cow<'_, str> { | ||
if transitional { | ||
hsivonen marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
let mut chars = domain.chars(); | ||
loop { | ||
let prev = chars.clone(); | ||
if let Some(c) = chars.next() { | ||
match c { | ||
'ß' | 'ẞ' | 'ς' | '\u{200C}' | '\u{200D}' => { | ||
let mut s = String::with_capacity(domain.len()); | ||
let tail = prev.as_str(); | ||
let head = &domain[..domain.len() - tail.len()]; | ||
s.push_str(head); | ||
for c in tail.chars() { | ||
match c { | ||
'ß' | 'ẞ' => { | ||
s.push_str("ss"); | ||
} | ||
'ς' => { | ||
s.push('σ'); | ||
} | ||
'\u{200C}' | '\u{200D}' => {} | ||
_ => { | ||
s.push(c); | ||
} | ||
} | ||
} | ||
return Cow::Owned(s); | ||
} | ||
_ => {} | ||
} | ||
} else { | ||
break; | ||
} | ||
} | ||
} | ||
Cow::Borrowed(domain) | ||
} | ||
|
||
/// Deprecated. Use the crate-top-level functions or [`Uts46`]. | ||
#[derive(Default)] | ||
#[deprecated] | ||
pub struct Idna { | ||
config: Config, | ||
} | ||
|
||
impl Idna { | ||
pub fn new(config: Config) -> Self { | ||
Self { config } | ||
} | ||
|
||
/// [UTS 46 ToASCII](http://www.unicode.org/reports/tr46/#ToASCII) | ||
#[allow(clippy::wrong_self_convention)] | ||
pub fn to_ascii(&mut self, domain: &str, out: &mut String) -> Result<(), Errors> { | ||
hsivonen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
let mapped = map_transitional(domain, self.config.transitional_processing); | ||
match Uts46::new().process( | ||
mapped.as_bytes(), | ||
self.config.deny_list(), | ||
self.config.hyphens(), | ||
ErrorPolicy::FailFast, | ||
hsivonen marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|_, _, _| false, | ||
out, | ||
None, | ||
) { | ||
Ok(ProcessingSuccess::Passthrough) => { | ||
if self.config.verify_dns_length && !verify_dns_length(&mapped) { | ||
return Err(crate::Errors::default()); | ||
} | ||
out.push_str(&mapped); | ||
Ok(()) | ||
} | ||
Ok(ProcessingSuccess::WroteToSink) => { | ||
if self.config.verify_dns_length && !verify_dns_length(out) { | ||
return Err(crate::Errors::default()); | ||
} | ||
Ok(()) | ||
} | ||
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()), | ||
Err(ProcessingError::SinkError) => unreachable!(), | ||
} | ||
} | ||
|
||
/// [UTS 46 ToUnicode](http://www.unicode.org/reports/tr46/#ToUnicode) | ||
#[allow(clippy::wrong_self_convention)] | ||
pub fn to_unicode(&mut self, domain: &str, out: &mut String) -> Result<(), Errors> { | ||
hsivonen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
let mapped = map_transitional(domain, self.config.transitional_processing); | ||
match Uts46::new().process( | ||
mapped.as_bytes(), | ||
self.config.deny_list(), | ||
self.config.hyphens(), | ||
ErrorPolicy::MarkErrors, | ||
|_, _, _| true, | ||
out, | ||
None, | ||
) { | ||
Ok(ProcessingSuccess::Passthrough) => { | ||
out.push_str(&mapped); | ||
Ok(()) | ||
} | ||
Ok(ProcessingSuccess::WroteToSink) => Ok(()), | ||
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()), | ||
Err(ProcessingError::SinkError) => unreachable!(), | ||
} | ||
} | ||
} | ||
|
||
/// Deprecated configuration API. | ||
#[derive(Clone, Copy)] | ||
#[must_use] | ||
#[deprecated] | ||
pub struct Config { | ||
use_std3_ascii_rules: bool, | ||
transitional_processing: bool, | ||
verify_dns_length: bool, | ||
check_hyphens: bool, | ||
} | ||
|
||
/// The defaults are that of _beStrict=false_ in the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna) | ||
impl Default for Config { | ||
hsivonen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fn default() -> Self { | ||
Config { | ||
use_std3_ascii_rules: false, | ||
transitional_processing: false, | ||
check_hyphens: false, | ||
// Only use for to_ascii, not to_unicode | ||
verify_dns_length: false, | ||
} | ||
} | ||
} | ||
|
||
impl Config { | ||
/// Whether to enforce STD3 or WHATWG URL Standard ASCII deny list. | ||
/// | ||
/// `true` for STD3, `false` for no deny list. | ||
/// | ||
/// Note that `true` rejects pseudo-hosts used by various TXT record-based protocols. | ||
#[inline] | ||
pub fn use_std3_ascii_rules(mut self, value: bool) -> Self { | ||
self.use_std3_ascii_rules = value; | ||
self | ||
} | ||
|
||
/// Whether to enable (deprecated) transitional processing. | ||
/// | ||
/// Note that Firefox, Safari, and Chrome do not use transitional | ||
/// processing. | ||
#[inline] | ||
#[allow(unused_mut)] | ||
pub fn transitional_processing(mut self, value: bool) -> Self { | ||
self.transitional_processing = value; | ||
self | ||
} | ||
|
||
/// Whether the _VerifyDNSLength_ operation should be performed | ||
/// by `to_ascii`. | ||
#[inline] | ||
pub fn verify_dns_length(mut self, value: bool) -> Self { | ||
self.verify_dns_length = value; | ||
self | ||
} | ||
|
||
/// Whether to enforce IETF rules for hyphen placement. | ||
/// | ||
/// `true` to deny hyphens in the first, last, third, and fourth | ||
/// position of a label. `false` to not enforce. | ||
/// | ||
/// Note that `true` rejects real-world names, including YouTube CDN nodes | ||
/// and some GitHub user pages. | ||
#[inline] | ||
pub fn check_hyphens(mut self, value: bool) -> Self { | ||
self.check_hyphens = value; | ||
self | ||
} | ||
|
||
/// Obsolete method retained to ease migration. The argument must be `false`. | ||
/// | ||
/// Panics | ||
/// | ||
/// If the argument is `true`. | ||
#[inline] | ||
#[allow(unused_mut)] | ||
pub fn use_idna_2008_rules(mut self, value: bool) -> Self { | ||
assert!(!value, "IDNA 2008 rules are no longer supported"); | ||
self | ||
} | ||
|
||
/// Compute the deny list | ||
fn deny_list(&self) -> AsciiDenyList { | ||
if self.use_std3_ascii_rules { | ||
AsciiDenyList::STD3 | ||
} else { | ||
AsciiDenyList::EMPTY | ||
} | ||
} | ||
|
||
/// Compute the hyphen mode | ||
fn hyphens(&self) -> Hyphens { | ||
if self.check_hyphens { | ||
Hyphens::Check | ||
} else { | ||
Hyphens::Allow | ||
} | ||
} | ||
|
||
/// [UTS 46 ToASCII](http://www.unicode.org/reports/tr46/#ToASCII) | ||
pub fn to_ascii(self, domain: &str) -> Result<String, Errors> { | ||
let mut result = String::with_capacity(domain.len()); | ||
let mut codec = Idna::new(self); | ||
codec.to_ascii(domain, &mut result).map(|()| result) | ||
} | ||
|
||
/// [UTS 46 ToUnicode](http://www.unicode.org/reports/tr46/#ToUnicode) | ||
pub fn to_unicode(self, domain: &str) -> (String, Result<(), Errors>) { | ||
let mut codec = Idna::new(self); | ||
let mut out = String::with_capacity(domain.len()); | ||
let result = codec.to_unicode(domain, &mut out); | ||
(out, result) | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.