Thanks to visit codestin.com
Credit goes to github.com

Skip to content

herniqeu/whatwgpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

whatwgpy

A pure Python implementation of the WHATWG URL Standard that achieves 100% conformance with the Web Platform Tests.

This parser follows the living standard at https://url.spec.whatwg.org/ and correctly handles all 869 URL parsing test cases and all 87 ToASCII domain encoding tests from the official WPT suite.

Test Results

WPT URL Parsing:   869/869  (100.0%)
ToASCII Domains:    87/87   (100.0%)

Installation

pip install -r requirements.txt

The only dependency is idna for internationalized domain name processing.

Usage

from url_parser import parse_url

url = parse_url("https://user:[email protected]:8080/path?query#fragment")
print(url.href)        # https://user:[email protected]:8080/path?query#fragment
print(url.hostname_str)    # example.com
print(url.pathname)    # /path

# Relative URL resolution
url = parse_url("/new/path", base="https://example.com/old/path")
print(url.href)        # https://example.com/new/path

URLRecord Properties

The parse_url function returns a URLRecord containing:

scheme          URL scheme without colon (e.g., "https")
username        Username component
password        Password component  
host            Domain string, IPv4Address, or IPv6Address
port            Port number or None for default ports
path            List of path segments or string for opaque paths
query           Query string without leading ?
fragment        Fragment without leading #

Formatted output properties:

href            Complete serialized URL
origin          URL origin for CORS
protocol        Scheme with colon (e.g., "https:")
host_str        Host:port (port omitted if default)
hostname_str    Host without port
port_str        Port as string or empty
pathname        Serialized path with leading /
search          Query with leading ? or empty
hash            Fragment with leading # or empty

What This Parser Handles

The WHATWG URL Standard defines complex parsing behavior that browsers implement. This parser handles:

Special schemes         http, https, ftp, ws, wss, file with default ports
IPv4 addresses          Decimal, octal (0-prefix), and hex (0x-prefix)
IPv6 addresses          Full form, compressed (::), and embedded IPv4
Domain encoding         IDNA2008 with UTS46 compatibility processing
Percent encoding        Context-specific encode sets per spec
Windows paths           Drive letter normalization in file URLs
Opaque paths            Non-hierarchical URLs like mailto: and javascript:
Relative resolution     Full base URL resolution algorithm

Running Tests

Verify conformance against the Web Platform Tests:

python wpt_runner.py
python toascii_runner.py

Results are saved to the results/ directory.

Project Structure

url_parser/
    __init__.py         Package exports
    parser.py           State machine with 21 parsing states
    url_record.py       URL record data structure
    host.py             Host parsing, IPv4/IPv6, domain-to-ASCII
    encoding.py         Percent-encoding utilities

wpt_runner.py           WPT conformance test runner
toascii_runner.py       ToASCII test runner  
wpt/                    Web Platform Tests
results/                Test output

About

Pure Python WHATWG URL parser with 100% Web Platform Tests conformance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages