Hypertext Abstract Syntax Tree format.
hast is a specification for representing HTML (and embedded SVG or MathML) as an abstract syntax tree. It implements the unist spec.
This document may not be released.
See releases for released documents.
The latest released version is 2.4.0.
- Introduction
- Types
- Nodes (abstract)
- Nodes
- Other types
- Glossary
- List of utilities
- Related HTML utilities
- References
- Security
- Related
- Contribute
- Acknowledgments
- License
This document defines a format for representing hypertext as an abstract syntax tree. Development of hast started in April 2016 for rehype. This specification is written in a Web IDL-like grammar.
hast extends unist, a format for syntax trees, to benefit from its ecosystem of utilities.
hast relates to JavaScript in that it has an ecosystem of utilities for working with compliant syntax trees in JavaScript. However, hast is not limited to JavaScript and can be used in other programming languages.
hast relates to the unified and rehype projects in that hast syntax trees are used throughout their ecosystems.
The reason for introducing a new βvirtualβ DOM is primarily:
- The DOM is very heavy to implement outside of the browser, a lean and stripped down virtual DOM can be used everywhere
- Most virtual DOMs do not focus on ease of use in transformations
- Other virtual DOMs cannot represent the syntax of HTML in its entirety (think comments and document types)
- Neither the DOM nor virtual DOMs focus on positional information
If you are using TypeScript, you can use the hast types by installing them with npm:
npm install @types/hastinterface Literal <: UnistLiteral {
value: string
}Literal (UnistLiteral) represents a node in hast containing a value.
interface Parent <: UnistParent {
children: [Comment | Doctype | Element | Text]
}Parent (UnistParent) represents a node in hast containing other nodes (said to be children).
Its content is limited to only other hast content.
interface Comment <: Literal {
type: 'comment'
}Comment (Literal) represents a Comment ([DOM]).
For example, the following HTML:
<!--Charlie-->Yields:
{type: 'comment', value: 'Charlie'}interface Doctype <: Node {
type: 'doctype'
}Doctype (Node) represents a DocumentType ([DOM]).
For example, the following HTML:
<!doctype html>Yields:
{type: 'doctype'}interface Element <: Parent {
type: 'element'
tagName: string
properties: Properties
content: Root?
children: [Comment | Element | Text]
}Element (Parent) represents an Element ([DOM]).
A tagName field must be present.
It represents the elementβs local name
([DOM]).
The properties field represents information associated with the element.
The value of the properties field implements the
Properties interface.
If the tagName field is 'template',
a content field can be present.
The value of the content field implements the Root interface.
If the tagName field is 'template',
the element must be a leaf.
If the tagName field is 'noscript',
its children should be represented as if
scripting is disabled ([HTML]).
For example, the following HTML:
<a href="https://alpha.com" class="bravo" download></a>Yields:
{
type: 'element',
tagName: 'a',
properties: {
href: 'https://alpha.com',
className: ['bravo'],
download: true
},
children: []
}interface Root <: Parent {
type: 'root'
}Root (Parent) represents a document.
Root can be used as the root of a tree,
or as a value of the content field on a 'template'
Element,
never as a child.
interface Text <: Literal {
type: 'text'
}Text (Literal) represents a Text ([DOM]).
For example, the following HTML:
<span>Foxtrot</span>Yields:
{
type: 'element',
tagName: 'span',
properties: {},
children: [{type: 'text', value: 'Foxtrot'}]
}interface Properties {}Properties represents information associated with an element.
Every field must be a PropertyName and every value a PropertyValue.
typedef string PropertyNameProperty names are keys on Properties objects and reflect
HTML,
SVG,
ARIA,
XML,
XMLNS,
or XLink attribute names.
Often,
they have the same value as the corresponding attribute
(for example,
id is a property name reflecting the id attribute name),
but there are some notable differences.
These rules arenβt simple. Use
hastscript(orproperty-informationdirectly) to help.
The following rules are used to transform HTML attribute names to property names. These rules are based on how ARIA is reflected in the DOM ([ARIA]), and differs from how some (older) HTML attributes are reflected in the DOM.
- any name referencing a combinations of multiple words
(such as βstroke miter limitβ) becomes a camelcased property name
capitalizing each word boundary;
this includes combinations that are sometimes written as several words;
for example,
stroke-miterlimitbecomesstrokeMiterLimit,autocorrectbecomesautoCorrect, andallowfullscreenbecomesallowFullScreen - any name that can be hyphenated,
becomes a camelcased property name capitalizing each boundary;
for example,
βread-onlyβ becomes
readOnly - compound words that are not used with spaces or hyphens are treated as a normal word and the previous rules apply; for example, βplaceholderβ, βstrikethroughβ, and βplaybackβ stay the same
- acronyms in names are treated as a normal word and the previous rules apply;
for example,
itemidbecomeitemIdandbgcolorbecomesbgColor
Some jargon is seen as one word even though it may not be seen as such by
dictionaries.
For example,
nohref becomes noHref,
playsinline becomes playsInline,
and accept-charset becomes acceptCharset.
The HTML attributes class and for respectively become className and
htmlFor in alignment with the DOM.
No other attributes gain different names as properties,
other than a change in casing.
property-information lists all property names.
The property name rules differ from how HTML is reflected in the DOM for the following attributes:
View list of differences
charoffbecomescharOff(notchOff)charstayschar(does not becomech)relstaysrel(does not becomerelList)checkedstayschecked(does not becomedefaultChecked)mutedstaysmuted(does not becomedefaultMuted)valuestaysvalue(does not becomedefaultValue)selectedstaysselected(does not becomedefaultSelected)allowfullscreenbecomesallowFullScreen(notallowFullscreen)hreflangbecomeshrefLang, nothreflangautoplaybecomesautoPlay, notautoplayautocompletebecomesautoComplete(notautocomplete)autofocusbecomesautoFocus, notautofocusenctypebecomesencType, notenctypeformenctypebecomesformEncType(notformEnctype)vspacebecomesvSpace, notvspacehspacebecomeshSpace, nothspacelowsrcbecomeslowSrc, notlowsrc
typedef any PropertyValueProperty values should reflect the data type determined by their property name.
For example,
the HTML <div hidden></div> has a hidden attribute,
which is reflected as a hidden property name set to the property value true,
and <input minlength="5">,
which has a minlength attribute,
is reflected as a minLength property name set to the property value 5.
In JSON, the value
nullmust be treated as if the property was not included. In JavaScript, bothnullandundefinedmust be similarly ignored.
The DOM has strict rules on how it coerces HTML to expected values,
whereas hast is more lenient in how it reflects the source.
Where the DOM treats <div hidden="no"></div> as having a value of true and
<img width="yes"> as having a value of 0,
these should be reflected as 'no' and 'yes',
respectively,
in hast.
The reason for this is to allow plugins and utilities to inspect these non-standard values.
The DOM also specifies comma separated and space separated lists attribute
values.
In hast, these should be treated as ordered lists.
For example,
<div class="alpha bravo"></div> is represented as ['alpha', 'bravo'].
Thereβs no special format for the property value of the
styleproperty name.
See Β§ Glossary in syntax-tree/unist.
See Β§ List of utilities in syntax-tree/unist
for more utilities.
hastscriptβ create treeshast-util-assertβ assert nodeshast-util-class-listβ simulate the browserβsclassListAPI for hast nodeshast-util-classnamesβ merge class names togetherhast-util-embeddedβ check if a node is an embedded elementhast-util-excerptβ truncate the tree to a commenthast-util-find-and-replaceβ find and replace text in a treehast-util-formatβ format whitespacehast-util-from-domβ transform from DOM treehast-util-from-htmlβ parse from HTMLhast-util-from-parse5β transform from Parse5βs ASThast-util-from-selectorβ parse CSS selectors to nodeshast-util-from-stringβ set the plain-text value of a node (textContent)hast-util-from-textβ set the plain-text value of a node (innerText)hast-util-from-webparserβ transform Webparserβs AST to hasthast-util-has-propertyβ check if an element has a certain propertyhast-util-headingβ check if a node is heading contenthast-util-heading-rankβ get the rank (also known as depth or level) of headingshast-util-interactiveβ check if a node is interactivehast-util-is-body-ok-linkβ check if alinkelement is βBody OKβhast-util-is-conditional-commentβ check ifnodeis a conditional commenthast-util-is-css-linkβ check ifnodeis a CSSlinkhast-util-is-css-styleβ check ifnodeis a CSSstylehast-util-is-elementβ check ifnodeis a (certain) elementhast-util-is-event-handlerβ check ifpropertyis an event handlerhast-util-is-javascriptβ check ifnodeis a JavaScriptscripthast-util-labelableβ check ifnodeis labelablehast-util-minify-whitespaceβ minify whitespace between elementshast-util-parse-selectorβ create an element from a simple CSS selectorhast-util-phrasingβ check if a node is phrasing contenthast-util-rawβ parse a tree againhast-util-reading-timeβ estimate the reading timehast-util-sanitizeβ sanitize nodeshast-util-script-supportingβ check ifnodeis script-supporting contenthast-util-selectβquerySelector,querySelectorAll, andmatcheshast-util-sectioningβ check ifnodeis sectioning contenthast-util-shift-headingβ change heading rank (depth, level)hast-util-table-cell-styleβ transform deprecated styling attributes on table cells to inline styleshast-util-to-domβ transform to a DOM treehast-util-to-estreeβ transform to estree (JavaScript AST) JSXhast-util-to-htmlβ serialize as HTMLhast-util-to-jsxβ transform hast to JSXhast-util-to-jsx-runtimeβ transform to preact, react, solid, svelte, vue, etchast-util-to-mdastβ transform to mdast (markdown)hast-util-to-nlcstβ transform to nlcst (natural language)hast-util-to-parse5β transform to Parse5βs ASThast-util-to-portable-textβ transform to portable texthast-util-to-stringβ get the plain-text value of a node (textContent)hast-util-to-textβ get the plain-text value of a node (innerText)hast-util-to-xastβ transform to xast (xml)hast-util-transparentβ check ifnodeis transparent contenthast-util-truncateβ truncate the tree to a certain number of charactershast-util-whitespaceβ check ifnodeis inter-element whitespace
a-relβ List of link types forrelona/areaaria-attributesβ List of ARIA attributescollapse-white-spaceβ Replace multiple white-space characters with a single spacecomma-separated-tokensβ Parse/stringify comma separated tokenshtml-tag-namesβ List of HTML tag nameshtml-dangerous-encodingsβ List of dangerous HTML character encoding labelshtml-encodingsβ List of HTML character encoding labelshtml-element-attributesβ Map of HTML attributeshtml-event-attributesβ List of HTML event handler content attributeshtml-void-elementsβ List of void HTML tag nameslink-relβ List of link types forrelonlinkmathml-tag-namesβ List of MathML tag namesmeta-nameβ List of values fornameonmetaproperty-informationβ Information on HTML propertiesspace-separated-tokensβ Parse/stringify space separated tokenssvg-tag-namesβ List of SVG tag namessvg-element-attributesβ Map of SVG attributessvg-event-attributesβ List of SVG event handler content attributesweb-namespacesβ Map of web namespaces
- unist: Universal Syntax Tree. T. Wormer; et al.
- JavaScript: ECMAScript Language Specification. Ecma International.
- HTML: HTML Standard, A. van Kesteren; et al. WHATWG.
- DOM: DOM Standard, A. van Kesteren, A. Gregor, Ms2ger. WHATWG.
- SVG: Scalable Vector Graphics (SVG), N. Andronikos, R. Atanassov, T. Bah, B. Birtles, B. Brinza, C. Concolato, E. DahlstrΓΆm, C. Lilley, C. McCormack, D. Schepers, R. Schwerdtfeger, D. Storey, S. Takagi, J. Watt. W3C.
- MathML: Mathematical Markup Language Standard, D. Carlisle, P. Ion, R. Miner. W3C.
- ARIA: Accessible Rich Internet Applications (WAI-ARIA), J. Diggs, J. Craig, S. McCarron, M. Cooper. W3C.
- JSON The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray. IETF.
- Web IDL: Web IDL, C. McCormack. W3C.
As hast represents HTML,
and improper use of HTML can open you up to a
cross-site scripting (XSS) attack,
improper use of hast is also unsafe.
Always be careful with user input and use
hast-util-santize to make the hast tree safe.
- mdast β Markdown Abstract Syntax Tree format
- nlcst β Natural Language Concrete Syntax Tree format
- xast β Extensible Abstract Syntax Tree
See contributing.md in
syntax-tree/.github for ways to get started.
See support.md for ways to get help.
A curated list of awesome syntax-tree, unist, mdast, hast, xast, and nlcst resources can be found in awesome syntax-tree.
This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.
The initial release of this project was authored by @wooorm.
Special thanks to @eush77 for their work, ideas, and incredibly valuable feedback!
Thanks to @andrewburgess, @arobase-che, @arystan-sw, @BarryThePenguin, @brechtcs, @ChristianMurphy, @ChristopherBiscardi, @craftzdog, @cupojoe, @davidtheclark, @derhuerst, @detj, @DxCx, @erquhart, @flurmbo, @Hamms, @Hypercubed, @inklesspen, @jeffal, @jlevy, @Justineo, @lfittl, @kgryte, @kmck, @kthjm, @KyleAMathews, @macklinu, @medfreeman, @Murderlon, @nevik, @nokome, @phiresky, @revolunet, @rhysd, @Rokt33r, @rubys, @s1n, @Sarah-Seo, @sethvincent, @simov, @StarpTech, @stefanprobst, @stuff, @subhero24, @tripodsan, @tunnckoCore, @vhf, @voischev, and @zjaml, for contributing to hast and related projects!