Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
a8977a8
Reimplement idna on top of ICU4X
hsivonen Feb 14, 2024
09765af
Add an even faster lower-case ASCII letter path to avoid regressing p…
hsivonen Mar 20, 2024
7e929ce
Comments and verify_dns_length tweak
hsivonen Mar 21, 2024
f413387
Parametrize internal vs. external Punycode caller; restore external A…
hsivonen Mar 21, 2024
71c03b9
Add bench for to_ascii on an already-Punycode name
hsivonen Mar 21, 2024
9af00cb
Avoid re-encoding Punycode when possible
hsivonen Mar 21, 2024
dc8f301
Pass through the input slice in many more cases
hsivonen Mar 21, 2024
41e0192
Add testing for the simultaneous mode
hsivonen Mar 21, 2024
41f2107
Omit the invalid domain character check on the url side
hsivonen Mar 21, 2024
4d7d41a
Document that Punycode labels must result in non-ASCII
hsivonen Mar 21, 2024
98ca752
Rename files called uts46.rs to deprecated.rs
hsivonen Mar 21, 2024
4bbabe9
Rename uts46bis to uts46
hsivonen Mar 21, 2024
7dc0082
Tweak docs
hsivonen Mar 21, 2024
f8eb96e
Avoid useless copying and useless UTF-8 decode
hsivonen Apr 11, 2024
eb6e3d5
Use inline(never) to optimize binary size
hsivonen Apr 15, 2024
ce3d4d1
Split CheckHyphens into a separate concern form the ASCII deny list
hsivonen Apr 16, 2024
6672161
Make the ASCII deny list customizable
hsivonen Apr 18, 2024
90fe4b3
Better docs and top-level functions
hsivonen Apr 18, 2024
50381ff
Parameter for VerifyDNSLength
hsivonen Apr 18, 2024
8268c5a
Restore support for transitional processing to minimize breakage
hsivonen Apr 18, 2024
999bef4
In the deprecated API, use empty deny list with use_std3_ascii_rules=…
hsivonen Apr 18, 2024
b277c85
Tweak docs
hsivonen Apr 18, 2024
980348c
Docs, rename AsciiDenyList::WHATWG to ::URL, tweak top-level functions
hsivonen Apr 19, 2024
4efd589
Use idna crate top-level function in the url crate to dogfood the top…
hsivonen Apr 22, 2024
da6cf50
Add an Usage section to the README
hsivonen Apr 24, 2024
d938024
Add an early return to map_transitional for readability
hsivonen Apr 26, 2024
679edb9
Document internal vs. external Punycode caller differences
hsivonen Apr 26, 2024
4f605c9
Per discussion with Valentin, revert deprecated API to the old behavi…
hsivonen May 3, 2024
bbf4308
Add comments about not fixing deprecated API
hsivonen May 3, 2024
e842dae
Merge branch 'main' into icu4x
hsivonen May 3, 2024
6690c49
Add a comment explaining FailFast in deprecated.rs
hsivonen May 3, 2024
38cedad
For future-proofing, add compiled_data cargo feature (currently alway…
hsivonen May 3, 2024
52137e7
Remove remark about spec violation by making root dot permissibility …
hsivonen May 20, 2024
081f44b
Clarify README about IDNA 2003/2008
hsivonen May 20, 2024
aaa7a40
Add a historical remark to the README
hsivonen May 20, 2024
8b03034
Fix typo
hsivonen May 20, 2024
c8a4bd3
Depend on crates.io versions of icu_normalizer and icu_properties
hsivonen May 23, 2024
be3db8e
Address clippy lints
hsivonen May 23, 2024
6020673
Update versions
hsivonen May 23, 2024
245c514
Increment dependency versions
hsivonen May 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions idna/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ doctest = false

[features]
default = ["std"]
std = ["alloc", "unicode-bidi/std", "unicode-normalization/std"]
std = ["alloc"]
alloc = []

[[test]]
Expand All @@ -25,15 +25,20 @@ harness = false
[[test]]
name = "unit"

[[test]]
name = "unitbis"

[dev-dependencies]
assert_matches = "1.3"
bencher = "0.1"
tester = "0.9"
serde_json = "1.0"

[dependencies]
unicode-bidi = { version = "0.3.10", default-features = false, features = ["hardcoded-data"] }
unicode-normalization = { version = "0.1.22", default-features = false }
icu_normalizer = { path = "../../icu4x/components/normalizer", features = ["compiled_data"] }
icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }
utf8_iter = "1.0.4"
smallvec = { version = "1.13.1", features = ["const_generics"]}

[[bench]]
name = "all"
Expand Down
35 changes: 35 additions & 0 deletions idna/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# `idna`

IDNA library for Rust implementing [UTS 46: Unicode IDNA Compatibility Processing](https://www.unicode.org/reports/tr46/) as parametrized by the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna).

## What it does

* An implementation of UTS 46 is provided, with configurable ASCII deny list (e.g. STD3 or WHATWG rules).
* A callback mechanism is provided for pluggable logic for deciding if a label is deemed potentially too misleading to render as Unicode in a user interface.
* Errors are marked as U+FFFD REPLACEMENT CHARACTERs in Unicode output so that locations of errors may be illustrated to the user.

## What it does not do

* There is no default/sample policy provided for the callback mechanism mentioned above.
* Earlier variants of IDNA (2003, 2008) are not implemented—only UTS 46.
* There is no API for categorizing errors beyond there being an error.
* Checks that are configurable in UTS 46 but that the WHATWG URL Standard always set a particular way (regardless of the _beStrict_ flag in the URL Standard) cannot be configured (with the exception of the old deprecated API supporting transitional processing).

## Usage

Apps that need to prepare a hostname for usage in protocols are likely to only need the top-level function `domain_to_ascii_cow` with `AsciiDenyList::URL` as the second argument. Note that this rejects IPv6 addresses, so before this, you need to check if the first byte of the input is `b'['` and, if it is, treat the input as an IPv6 address instead.

Apps that need to display host names to the user should use `uts46::Uts46::to_user_interface`. The _ToUnicode_ operation is rarely appropriate for direct application usage.

## Known spec violations

* The `verify_dns_length` behavior that this crate implements allows a trailing dot in the input as required by the UTS 46 test suite despite the UTS 46 spec saying that this isn't allowed.

## Breaking changes since 0.5.0

* IDNA 2008 rules are no longer supported. Attempting to enable them panics immediately.
* `check_hyphens` now also rejects the hyphen in the third and fourth position in a label.
* `domain_to_ascii_strict` now performs the _CheckHyphens_ check (matching previous documentation).
* The ContextJ rules are now implemented and always enabled, so input that fails those rules is rejected.
* The `Idna::to_ascii_inner` method has been removed. It didn't make sense as a public method, since callers were unable to figure out if there were errors. (A GitHub search found no callers for this method.)
* Punycode labels whose decoding does not yield any non-ASCII characters are now treated as being in error.
7 changes: 7 additions & 0 deletions idna/benches/all.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ fn to_unicode_puny_label(bench: &mut Bencher) {
bench.iter(|| config.to_unicode(black_box(encoded)));
}

fn to_ascii_already_puny_label(bench: &mut Bencher) {
let encoded = "abc.xn--mgbcm";
let config = Config::default();
bench.iter(|| config.to_ascii(black_box(encoded)));
}

fn to_unicode_ascii(bench: &mut Bencher) {
let encoded = "example.com";
let config = Config::default();
Expand Down Expand Up @@ -47,6 +53,7 @@ benchmark_group!(
to_unicode_ascii,
to_unicode_merged_label,
to_ascii_puny_label,
to_ascii_already_puny_label,
to_ascii_simple,
to_ascii_merged,
);
Expand Down
8,727 changes: 0 additions & 8,727 deletions idna/src/IdnaMappingTable.txt

This file was deleted.

246 changes: 246 additions & 0 deletions idna/src/deprecated.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
// Copyright 2013-2014 The rust-url developers.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! Deprecated API for [*Unicode IDNA Compatibility Processing*
//! (Unicode Technical Standard #46)](http://www.unicode.org/reports/tr46/)
#![allow(deprecated)]

use alloc::borrow::Cow;
use alloc::string::String;

use crate::uts46::*;
use crate::Errors;

/// Performs preprocessing equivalent to UTS 46 transitional processing
/// if `transitional` is `true`. If `transitional` is `false`, merely
/// lets the input pass through as-is.
///
/// The output of this function is to be passed to [`Uts46::process`].
///
/// Deprecated, since this functionality is deprecated in UTS 46 itself,
/// and none of Firefox, Safari, or Chrome use transitional processing.
#[deprecated]
fn map_transitional(domain: &str, transitional: bool) -> Cow<'_, str> {
if transitional {
let mut chars = domain.chars();
loop {
let prev = chars.clone();
if let Some(c) = chars.next() {
match c {
'ß' | 'ẞ' | 'ς' | '\u{200C}' | '\u{200D}' => {
let mut s = String::with_capacity(domain.len());
let tail = prev.as_str();
let head = &domain[..domain.len() - tail.len()];
s.push_str(head);
for c in tail.chars() {
match c {
'ß' | 'ẞ' => {
s.push_str("ss");
}
'ς' => {
s.push('σ');
}
'\u{200C}' | '\u{200D}' => {}
_ => {
s.push(c);
}
}
}
return Cow::Owned(s);
}
_ => {}
}
} else {
break;
}
}
}
Cow::Borrowed(domain)
}

/// Deprecated. Use the crate-top-level functions or [`Uts46`].
#[derive(Default)]
#[deprecated]
pub struct Idna {
config: Config,
}

impl Idna {
pub fn new(config: Config) -> Self {
Self { config }
}

/// [UTS 46 ToASCII](http://www.unicode.org/reports/tr46/#ToASCII)
#[allow(clippy::wrong_self_convention)]
pub fn to_ascii(&mut self, domain: &str, out: &mut String) -> Result<(), Errors> {
let mapped = map_transitional(domain, self.config.transitional_processing);
match Uts46::new().process(
mapped.as_bytes(),
self.config.deny_list(),
self.config.hyphens(),
ErrorPolicy::FailFast,
|_, _, _| false,
out,
None,
) {
Ok(ProcessingSuccess::Passthrough) => {
if self.config.verify_dns_length && !verify_dns_length(&mapped) {
return Err(crate::Errors::default());
}
out.push_str(&mapped);
Ok(())
}
Ok(ProcessingSuccess::WroteToSink) => {
if self.config.verify_dns_length && !verify_dns_length(out) {
return Err(crate::Errors::default());
}
Ok(())
}
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()),
Err(ProcessingError::SinkError) => unreachable!(),
}
}

/// [UTS 46 ToUnicode](http://www.unicode.org/reports/tr46/#ToUnicode)
#[allow(clippy::wrong_self_convention)]
pub fn to_unicode(&mut self, domain: &str, out: &mut String) -> Result<(), Errors> {
let mapped = map_transitional(domain, self.config.transitional_processing);
match Uts46::new().process(
mapped.as_bytes(),
self.config.deny_list(),
self.config.hyphens(),
ErrorPolicy::MarkErrors,
|_, _, _| true,
out,
None,
) {
Ok(ProcessingSuccess::Passthrough) => {
out.push_str(&mapped);
Ok(())
}
Ok(ProcessingSuccess::WroteToSink) => Ok(()),
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()),
Err(ProcessingError::SinkError) => unreachable!(),
}
}
}

/// Deprecated configuration API.
#[derive(Clone, Copy)]
#[must_use]
#[deprecated]
pub struct Config {
use_std3_ascii_rules: bool,
transitional_processing: bool,
verify_dns_length: bool,
check_hyphens: bool,
}

/// The defaults are that of _beStrict=false_ in the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna)
impl Default for Config {
fn default() -> Self {
Config {
use_std3_ascii_rules: false,
transitional_processing: false,
check_hyphens: false,
// Only use for to_ascii, not to_unicode
verify_dns_length: false,
}
}
}

impl Config {
/// Whether to enforce STD3 or WHATWG URL Standard ASCII deny list.
///
/// `true` for STD3, `false` for no deny list.
///
/// Note that `true` rejects pseudo-hosts used by various TXT record-based protocols.
#[inline]
pub fn use_std3_ascii_rules(mut self, value: bool) -> Self {
self.use_std3_ascii_rules = value;
self
}

/// Whether to enable (deprecated) transitional processing.
///
/// Note that Firefox, Safari, and Chrome do not use transitional
/// processing.
#[inline]
#[allow(unused_mut)]
pub fn transitional_processing(mut self, value: bool) -> Self {
self.transitional_processing = value;
self
}

/// Whether the _VerifyDNSLength_ operation should be performed
/// by `to_ascii`.
#[inline]
pub fn verify_dns_length(mut self, value: bool) -> Self {
self.verify_dns_length = value;
self
}

/// Whether to enforce IETF rules for hyphen placement.
///
/// `true` to deny hyphens in the first, last, third, and fourth
/// position of a label. `false` to not enforce.
///
/// Note that `true` rejects real-world names, including YouTube CDN nodes
/// and some GitHub user pages.
#[inline]
pub fn check_hyphens(mut self, value: bool) -> Self {
self.check_hyphens = value;
self
}

/// Obsolete method retained to ease migration. The argument must be `false`.
///
/// Panics
///
/// If the argument is `true`.
#[inline]
#[allow(unused_mut)]
pub fn use_idna_2008_rules(mut self, value: bool) -> Self {
assert!(!value, "IDNA 2008 rules are no longer supported");
self
}

/// Compute the deny list
fn deny_list(&self) -> AsciiDenyList {
if self.use_std3_ascii_rules {
AsciiDenyList::STD3
} else {
AsciiDenyList::EMPTY
}
}

/// Compute the hyphen mode
fn hyphens(&self) -> Hyphens {
if self.check_hyphens {
Hyphens::Check
} else {
Hyphens::Allow
}
}

/// [UTS 46 ToASCII](http://www.unicode.org/reports/tr46/#ToASCII)
pub fn to_ascii(self, domain: &str) -> Result<String, Errors> {
let mut result = String::with_capacity(domain.len());
let mut codec = Idna::new(self);
codec.to_ascii(domain, &mut result).map(|()| result)
}

/// [UTS 46 ToUnicode](http://www.unicode.org/reports/tr46/#ToUnicode)
pub fn to_unicode(self, domain: &str) -> (String, Result<(), Errors>) {
let mut codec = Idna::new(self);
let mut out = String::with_capacity(domain.len());
let result = codec.to_unicode(domain, &mut out);
(out, result)
}
}
Loading