Version 1.0
Created by Danslav Slavenskoj
Date: May 2025
Languages: 中文简体 中文繁體 Čeština Deutsch Español Français Hrvatski 日本語 한국어 Polski Português Русский Српски
- JSON Schema - Full validation schema for LVTag format
- Classifier Definitions - Machine-readable classifier specifications
- Specification - Jump to format details
- Examples - See LVTag in action
The Language Variant Tag (LVTag) format is a systematic approach to language classification that extends the BCP 47 standard using private-use subtags. It enables precise identification of language varieties across multiple dimensions including formality, politeness, domain, and orthography.
Classification Rigor: LVTag brings systematic organization to language tagging by providing clear, separate dimensions for different types of variation. Unlike existing subtags and systems that mix different categories at the same level, LVTag maintains strict separation between formality, politeness, domain, and other dimensions.
Standards Compatibility: LVTag is fully compliant with BCP 47 (RFC 5646) and works seamlessly with:
- IANA Language Subtag Registry
- ISO 639 language codes
- Unicode CLDR
- W3C language tags
- HTTP Accept-Language headers
- XML lang attributes
- HTML lang attributes
Technology Integration: LVTag tags can be used directly in:
- Natural Language Processing (NLP) pipelines
- Machine Translation systems
- Content Management Systems (CMS)
- Language detection libraries
- Search engines and information retrieval
- Web applications and APIs
- Localization workflows
Use Cases:
- Audience Targeting: Match content to appropriate audiences based on register and domain
- Translation Quality: Maintain appropriate formality and politeness levels in machine translation
- Language Learning: Teach learners appropriate register for different contexts
- Corpus Linguistics: Build precisely tagged corpora for research
- Social Media Analysis: Classify user-generated content by register and domain
- Customer Service: Route messages based on formality and domain to appropriate agents
While BCP 47 provides excellent support for identifying languages, scripts, and regions, it lacks standardized mechanisms for capturing sociolinguistic variation within a language. Current standards don't address:
- Register Variation: No way to distinguish between formal and informal varieties of the same language
- Politeness Levels: Critical for languages like Japanese, Korean, and Thai where politeness is grammatically encoded
- Domain-Specific Language: No standard for marking technical, medical, or legal language varieties
- Sociolects: No mechanism for identifying social group varieties (youth language, professional jargon)
- Historical Stages: Limited support for distinguishing classical from modern forms
- Formality Gradients: No numeric scale for computational processing of register
- Proto-Languages: Inconsistent encoding - some proto-languages have ISO codes (e.g.,
inefor PIE) while others don't, and ISO 639-5 family codes aren't valid in BCP 47 tags, creating a confusing landscape for historical linguistics - Orthographic Variation: While BCP 47 handles scripts, it doesn't effectively capture variations within scripts (spelling reforms, romanization systems, competing standards) that fundamentally affect text processing, search, and spell-checking
LVTag fills these gaps using BCP 47's private-use extension mechanism (-x-), providing a systematic, machine-readable way to encode these critical dimensions of language variation while maintaining full backward compatibility.
The advent of large language models and sophisticated NLP tools has made precise language variety classification not just useful but essential. Modern systems need to:
- Generate text appropriate to specific contexts (formal vs. informal, polite vs. casual)
- Train on properly classified corpora to avoid mixing registers inappropriately
- Provide culturally and contextually appropriate responses
- Handle code-switching and mixed-language content accurately
- Preserve stylistic consistency when translating or transforming text
- Filter training data based on formality, domain, or other characteristics
- Adapt output to match user preferences or requirements
LVTag provides the granular metadata need to understand not just what language is being used, but how it's being used, enabling more nuanced and appropriate language processing pipelines.
language-x-[classifier]-[value]-[classifier2]-[value2]...
Where:
languageis a valid BCP 47 primary language subtag (e.g.,en,ko,ja)xindicates the beginning of private-use subtagsclassifieris a category identifier (see Magic Tags below)valueis the specific classification within that category
LVTag supports both long-form and short-form "magic" classifiers for flexibility:
| Long Form | Short Form | Description |
|---|---|---|
ortho |
w |
Orthographic variant |
form |
f |
Formality level (1-5 scale) |
polite |
p |
Politeness/respect level (1-5 scale) |
domain |
d |
Specialized vocabulary or professional context |
geo |
g |
Geographic or regional variety |
proto |
a |
Proto-language or reconstructed language |
hist |
h |
Historical period or stage of a language |
genre |
e |
Text genre or literary style |
medium |
m |
Communication medium (spoken, written, digital) |
socio |
s |
Sociolect or social group variety |
modality |
o |
Mode of language production |
register |
r |
Linguistic register |
pragma |
u |
Communicative function |
temporal |
t |
Temporal marking |
evidence |
v |
Information source |
affect |
k |
Emotional tone |
age |
n |
Age/generation variety |
gender |
i |
Gender variety |
expert |
b |
Expertise level |
interact |
2 |
Interactional structure |
prosody |
y |
Prosodic features |
lexical |
l |
Lexical density (0-100) |
syntax |
z |
Syntactic complexity (0-100) |
start |
0 |
Start date (ISO 8601 without punctuation) |
end |
1 |
End date (ISO 8601 without punctuation) |
taboo |
j |
Taboo/vulgar content level (0-5 scale) |
conf |
c |
Confidence score (0-100) for previous tag |
| — | q, 3-9 |
Reserved for future use |
Identifies specific orthographic conventions or writing system variants beyond standard script tags.
Format:
- Long:
language-x-ortho-[variant] - Short:
language-x-w-[variant]
Examples (combined with standard script tags):
az-Latn-x-ortho-neworaz-Latn-x-w-new- Azerbaijani Latin script, new orthographyde-Latn-x-ortho-1901orde-Latn-x-w-1901- German Latin script, 1901 orthographyzh-Hans-x-ortho-pinyinorzh-Hans-x-w-pinyin- Simplified Chinese with Pinyinyi-Hebr-x-ortho-yivooryi-Hebr-x-w-yivo- Yiddish Hebrew script, YIVO orthography
Identifies the formality level of language use.
Format:
- Long:
language-x-form-[1-5] - Short:
language-x-f-[1-5]
Formality scale:
- 1 = Most formal (written documents, official speeches)
- 2 = Formal (business meetings, academic writing)
- 3 = Neutral/standard (news, general conversation)
- 4 = Informal (casual conversation, emails to friends)
- 5 = Most casual (intimate conversation, slang)
Examples:
ko-x-form-1orko-x-f-1- Most formal Koreanen-x-form-3oren-x-f-3- Neutral Englishja-x-form-5orja-x-f-5- Most casual Japanese
Identifies the politeness/respect level of language use.
Format:
- Long:
language-x-polite-[1-5] - Short:
language-x-p-[1-5]
Politeness scale:
- 1 = Most respectful/deferential (royal address, religious contexts)
- 2 = Very polite (formal honorifics, respectful speech)
- 3 = Polite/neutral (standard politeness)
- 4 = Familiar (among equals, friends)
- 5 = Intimate/plain (family, very close friends)
Examples:
ko-x-polite-1orko-x-p-1- Highest respect Koreanja-x-polite-2orja-x-p-2- Very polite Japaneseth-x-polite-3orth-x-p-3- Standard polite Thai
Identifies specialized vocabulary or professional context.
Format:
- Long:
language-x-domain-[domain_type] - Short:
language-x-d-[domain_type]
Examples:
en-x-domain-legaloren-x-d-legal- Legal Englishja-x-domain-medorja-x-d-med- Medical Japaneseko-x-domain-businessorko-x-d-business- Business Koreanja-x-domain-techorja-x-d-tech- Technical Japaneseen-x-domain-finoren-x-d-fin- Financial English
Identifies regional or geographic language varieties.
Format:
- Long:
language-x-geo-[region] - Short:
language-x-g-[region]
Examples:
ko-x-geo-gyeongorko-x-g-gyeong- Gyeongsang Korean (경상도)ko-x-geo-jeollaorko-x-g-jeolla- Jeolla Korean (전라도)es-x-geo-rioplaores-x-g-riopla- Rioplatense Spanishpt-x-geo-nordesteorpt-x-g-nordeste- Northeastern Brazilian Portuguese
Identifies proto-languages or reconstructed historical languages.
Format:
- Long:
x-proto-[iso639-5_code if available] - Short:
x-a-[iso639-5_code if available]
Rules:
- MUST use ISO 639-5 language family codes when available
- Use descriptive identifiers only when no ISO 639-5 code exists
Examples using ISO 639-5 codes:
x-proto-ineorx-a-ine- Proto-Indo-Europeanx-proto-gemorx-a-gem- Proto-Germanicx-proto-slaorx-a-sla- Proto-Slavicx-proto-semorx-a-sem- Proto-Semiticx-proto-celorx-a-cel- Proto-Celticx-proto-iraorx-a-ira- Proto-Iranianx-proto-incorx-a-inc- Proto-Indo-Aryanx-proto-batorx-a-bat- Proto-Balticx-proto-roaorx-a-roa- Proto-Romancex-proto-trkorx-a-trk- Proto-Turkic
Examples without ISO 639-5 codes (descriptive, longer than three characters):
x-proto-baltslavorx-a-baltslav- Proto-Balto-Slavic (no ISO 639-5 code)
Note:
- Language family codes (ISO 639-5) are NOT valid as standard primary BCP 47 language tags which is why we have implemented them using x-proto
- They are valid and preferred within private-use extensions (after
x-) - Therefore all proto-language tags must start with
x-to comply with BCP 47
Identifies historical periods or stages of a language.
Format:
- Long:
language-x-hist-[period] - Short:
language-x-h-[period]
Examples:
en-x-hist-oldoren-x-h-old- Old English perioden-x-hist-middleoren-x-h-middle- Middle English periodja-x-hist-kobunorja-x-h-kobun- Classical Japanese (古文)ko-x-hist-hunminorko-x-h-hunmin- Middle Korean (훈민정음 period)el-x-hist-koineorel-x-h-koine- Koine Greek (Κοινή)sa-x-hist-vedicorsa-x-h-vedic- Vedic Sanskrit (वैदिक)
Identifies text genre or literary style.
Format:
- Long:
language-x-genre-[genre_type] - Short:
language-x-e-[genre_type]
Examples:
en-x-genre-newsoren-x-e-news- News Englishja-x-genre-mangaorja-x-e-manga- Manga Japanese (漫画)ko-x-genre-webtoonorko-x-e-webtoon- Korean webtoon (웹툰)zh-x-genre-shiorzh-x-e-shi- Chinese poetry (詩)fr-x-genre-bdorfr-x-e-bd- French comics (bande dessinée)de-x-genre-marchenorde-x-e-marchen- German fairy tales (Märchen)
Identifies the communication medium.
Format:
- Long:
language-x-medium-[medium_type] - Short:
language-x-m-[medium_type]
Examples:
en-x-medium-spokenoren-x-m-spoken- Spoken Englishko-x-medium-digitalorko-x-m-digital- Digital/online Koreanja-x-medium-writtenorja-x-m-written- Written Japanesehi-x-medium-bcastorhi-x-m-bcast- Broadcast Hindizh-x-medium-smsorzh-x-m-sms- SMS/text message Chinese
Identifies sociolect or social group varieties.
Format:
- Long:
language-x-socio-[social_group] - Short:
language-x-s-[social_group]
Examples:
en-x-socio-academicoren-x-s-academic- Academic sociolecten-x-socio-urbanoren-x-s-urban- Urban sociolectes-x-socio-juvenilores-x-s-juvenil- Spanish youth sociolect (jerga juvenil)fr-x-socio-jeuneorfr-x-s-jeune- French youth sociolectde-x-socio-jugendorde-x-s-jugend- German youth sociolect (Jugendsprache)ko-x-socio-onlineorko-x-s-online- Korean online sociolect
Identifies the fundamental mode of language production.
Format:
- Long:
language-x-modality-[mode] - Short:
language-x-o-[mode]
Examples:
en-x-modality-spokenoren-x-o-spoken- Spoken Englishen-x-modality-writtenoren-x-o-written- Written Englishasl-x-modality-signedorasl-x-o-signed- American Sign Languageen-x-modality-multioren-x-o-multi- Multimodal English (speech + gestures)fr-x-modality-tactileorfr-x-o-tactile- Tactile French (for deafblind)
Identifies the linguistic register or functional variety of language use.
Format:
- Long:
language-x-register-[register_type] - Short:
language-x-r-[register_type]
Examples:
en-x-register-frozenoren-x-r-frozen- Frozen register (prayers, pledges)en-x-register-formaloren-x-r-formal- Formal register (academic papers)en-x-register-consultoren-x-r-consult- Consultative register (professional)en-x-register-casualoren-x-r-casual- Casual register (friends)en-x-register-intimateoren-x-r-intimate- Intimate register (family)
Identifies the communicative function or speech act.
Format:
- Long:
language-x-pragma-[function] - Short:
language-x-u-[function]
Examples:
en-x-pragma-requestoren-x-u-request- Request functionja-x-pragma-apologyorja-x-u-apology- Apology functiones-x-pragma-complmntores-x-u-complmnt- Compliment functionar-x-pragma-greetingorar-x-u-greeting- Greeting functionzh-x-pragma-refusalorzh-x-u-refusal- Refusal function
Identifies temporal aspects or tense usage patterns.
Format:
- Long:
language-x-temporal-[aspect] - Short:
language-x-t-[aspect]
Examples:
en-x-temporal-pastoren-x-t-past- Past-oriented discourseja-x-temporal-nonpastorja-x-t-nonpast- Non-past focusid-x-temporal-atemprlorid-x-t-atemprl- Timeless/atemporalfr-x-temporal-futureorfr-x-t-future- Future-orientedzh-x-temporal-aspectorzh-x-t-aspect- Aspectual focus
Identifies information source marking.
Format:
- Long:
language-x-evidence-[source] - Short:
language-x-v-[source]
Examples:
qu-x-evidence-directorqu-x-v-direct- Direct witnesstr-x-evidence-hearsayortr-x-v-hearsay- Hearsay/reportedja-x-evidence-inferorja-x-v-infer- Inferentialen-x-evidence-assumeoren-x-v-assume- Assumedde-x-evidence-quoteorde-x-v-quote- Quotative
Identifies emotional tone or affect.
Format:
- Long:
language-x-affect-[emotion] - Short:
language-x-k-[emotion]
Examples:
en-x-affect-angryoren-x-k-angry- Angry toneja-x-affect-humbleorja-x-k-humble- Humble affectes-x-affect-joyfulores-x-k-joyful- Joyful expressionko-x-affect-sadorko-x-k-sad- Sad/melancholicfr-x-affect-neutralorfr-x-k-neutral- Neutral affect
Identifies age-related or generational language varieties.
Format:
- Long:
language-x-age-[generation] - Short:
language-x-n-[generation]
Examples:
en-x-age-childoren-x-n-child- Child speechja-x-age-teenorja-x-n-teen- Teenager languageko-x-age-elderorko-x-n-elder- Elder speeches-x-age-genzores-x-n-genz- Generation Zzh-x-age-millenlorzh-x-n-millenl- Millennial speech
Identifies gender related language varieties.
Format:
- Long:
language-x-gender-[identity] - Short:
language-x-i-[identity]
Examples: (Examples removed)
Identifies level of domain expertise on a 0-10 scale.
Format:
- Long:
language-x-expert-[0-10] - Short:
language-x-b-[0-10]
Expertise scale:
- 0 = No knowledge
- 1-2 = Beginner
- 3-4 = Intermediate
- 5-6 = Advanced
- 7-8 = Expert
- 9-10 = Master/Authority
Examples:
en-x-expert-0oren-x-b-0- No expertisede-x-expert-3orde-x-b-3- Intermediate levelja-x-expert-7orja-x-b-7- Expert leveles-x-expert-9ores-x-b-9- Master levelzh-x-expert-5orzh-x-b-5- Advanced level
Identifies conversational or interactional patterns.
Format:
- Long:
language-x-interact-[structure] - Short:
language-x-2-[structure]
Examples:
en-x-interact-turnoren-x-2-turn- Turn-takingja-x-interact-overlaporja-x-2-overlap- Overlapping speeches-x-interact-monologores-x-2-monolog- Monologicar-x-interact-dialogorar-x-2-dialog- Dialogiczh-x-interact-multiorzh-x-2-multi- Multi-party
Identifies prosodic or suprasegmental features.
Format:
- Long:
language-x-prosody-[feature] - Short:
language-x-y-[feature]
Examples:
en-x-prosody-stressoren-x-y-stress- Stress-timedja-x-prosody-pitchorja-x-y-pitch- Pitch-accentfr-x-prosody-syllableorfr-x-y-syllable- Syllable-timedzh-x-prosody-toneorzh-x-y-tone- Tonal patternses-x-prosody-rhythmores-x-y-rhythm- Rhythmic patterns
Identifies lexical density as a numeric value (0-100).
Format:
- Long:
language-x-lexical-[0-100] - Short:
language-x-l-[0-100]
Examples:
en-x-lexical-20oren-x-l-20- Low density (20%)de-x-lexical-55orde-x-l-55- Medium density (55%)ja-x-lexical-75orja-x-l-75- High density (75%)es-x-lexical-40ores-x-l-40- Moderate density (40%)zh-x-lexical-85orzh-x-l-85- Very high density (85%)
Identifies syntactic complexity as a numeric value (0-100).
Format:
- Long:
language-x-syntax-[0-100] - Short:
language-x-z-[0-100]
Examples:
en-x-syntax-15oren-x-z-15- Simple syntax (15%)de-x-syntax-70orde-x-z-70- Complex syntax (70%)ja-x-syntax-45orja-x-z-45- Moderate complexity (45%)es-x-syntax-30ores-x-z-30- Low complexity (30%)zh-x-syntax-60orzh-x-z-60- High complexity (60%)
Identifies the start date of language use (ISO 8601 format without punctuation).
Format:
- Long:
language-x-start-[YYYYMMDD] - Short:
language-x-0-[YYYYMMDD]
Date formats:
- Full date: YYYYMMDD
- Year-month: YYYYMM
- Year only: YYYY
Examples:
en-x-start-20240315oren-x-0-20240315- English starting March 15, 2024ja-x-start-19890108orja-x-0-19890108- Japanese starting January 8, 1989es-x-start-202403ores-x-0-202403- Spanish starting March 2024
Identifies the end date of language use (ISO 8601 format without punctuation).
Format:
- Long:
language-x-end-[YYYYMMDD] - Short:
language-x-1-[YYYYMMDD]
Date formats:
- Full date: YYYYMMDD
- Year-month: YYYYMM
- Year only: YYYY
Examples:
en-x-end-20240415oren-x-1-20240415- English ending April 15, 2024ja-x-end-20190430orja-x-1-20190430- Japanese ending April 30, 2019es-x-end-202412ores-x-1-202412- Spanish ending December 2024
Identifies level of taboo, vulgar, or offensive content.
Format:
- Long:
language-x-taboo-[0-5] - Short:
language-x-j-[0-5]
Examples:
en-x-taboo-0oren-x-j-0- No taboo contenten-x-taboo-3oren-x-j-3- Moderate taboo levelja-x-form-5-taboo-4orja-x-f-5-j-4- Very casual Japanese with high taboo level
Indicates confidence score for the immediately preceding classifier.
Format:
- Long:
language-x-[classifier]-[value]-conf-[0-100] - Short:
language-x-[classifier]-[value]-c-[0-100]
Special behavior:
- The confidence score applies to the classifier immediately before it
- Multiple confidence scores can be used for different classifiers
- If no classifier precedes it, the confidence applies to the base language tag
Examples:
en-x-form-3-conf-95oren-x-f-3-c-95- Neutral formality with 95% confidenceko-x-polite-2-conf-80-domain-med-conf-60orko-x-p-2-c-80-d-med-c-60- Very polite (80% confidence) medical Korean (60% confidence)ja-x-hist-kobun-conf-100orja-x-h-kobun-c-100- Classical Japanese with 100% confidencex-proto-ine-conf-75orx-a-ine-c-75- Proto-Indo-European with 75% confidence
LVTag supports multiple classifiers in a single tag to provide precise language identification. Both long and short forms can be mixed:
ko-x-form-4-domain-business
ko-x-f-4-d-business
ko-x-form-4-polite-2-domain-business
ko-x-f-4-p-2-d-business
Examples above show Korean with informal formality (4) but polite speech (2) in business context.
Note: All values must be 8 characters or shorter to comply with BCP 47 subtag length restrictions. While specific values for many classifiers are to be established through expert usage and community consensus, the numeric scales, date formats, and basic values listed below are defined in this standard.
| Level | Description | Examples |
|---|---|---|
| 1 | Most formal | Legal documents, official ceremonies, academic papers |
| 2 | Formal | Business letters, news articles, presentations |
| 3 | Neutral | Standard conversation, email, general writing |
| 4 | Informal | Casual conversation, personal blogs, text messages |
| 5 | Most casual | Slang, intimate conversation, social media |
| Level | Description | Examples |
|---|---|---|
| 1 | Most respectful | Royal address, religious leaders, elderly respect |
| 2 | Very polite | Customer service, formal meetings, teachers |
| 3 | Polite/neutral | Standard interactions, colleagues |
| 4 | Familiar | Friends, peers, casual acquaintances |
| 5 | Intimate/plain | Close family, intimate partners |
| Level | Description |
|---|---|
| 0 | No knowledge |
| 1-2 | Beginner |
| 3-4 | Intermediate |
| 5-6 | Advanced |
| 7-8 | Expert |
| 9-10 | Master/Authority |
| Level | Description |
|---|---|
| 0 | No taboo content |
| 1 | Mild taboo |
| 2 | Light taboo |
| 3 | Moderate taboo |
| 4 | High taboo |
| 5 | Extreme taboo |
| Level | Description |
|---|---|
| 0-20 | Very low density |
| 21-40 | Low density |
| 41-60 | Moderate density |
| 61-80 | High density |
| 81-100 | Very high density |
| Level | Description |
|---|---|
| 0-20 | Very simple |
| 21-40 | Simple |
| 41-60 | Moderate complexity |
| 61-80 | Complex |
| 81-100 | Very complex |
| Value | Description |
|---|---|
legal |
Legal terminology |
med |
Medical terminology |
tech |
Technical/IT |
business |
Business/corporate |
fin |
Finance/banking |
acad |
Academic/scholarly |
sci |
Scientific/research |
# Most formal Korean
ko-x-form-1
# Very polite Japanese
ja-x-polite-2
# Legal English
en-x-domain-legal
# Gyeongsang Korean
ko-x-geo-gyeong
# Proto-Indo-European
x-proto-ine
# Most formal Korean
ko-x-f-1
# Very polite Japanese
ja-x-p-2
# Legal English
en-x-d-legal
# Gyeongsang Korean
ko-x-g-gyeong
# Proto-Indo-European
x-a-ine
# Informal but polite Korean business language
ko-x-form-4-polite-2-domain-business
ko-x-f-4-p-2-d-business
# Formal and respectful Japanese medical language
ja-x-form-1-polite-1-domain-med
ja-x-f-1-p-1-d-med
# Southern Vietnamese with neutral formality, polite speech, technical domain
vi-x-geo-southern-form-3-polite-2-domain-tech
vi-x-g-southern-f-3-p-2-d-tech
# Complex classification with multiple dimensions
en-x-h-middle-e-poetry-m-written-f-1
ja-x-f-2-p-1-d-med-h-kobun-m-written
# Language varieties showing formality/politeness distinction
ko-x-f-5-p-2 # Very casual but polite (to older friend)
ko-x-f-1-p-4 # Very formal but familiar (written to peer)
ja-x-f-4-p-1 # Casual formality but highest respect
en-x-f-5-j-4 # Very casual English with high taboo level
-
Language Learning Applications
- Teach appropriate register for different social contexts
- Provide domain-specific vocabulary training
-
Machine Translation
- Maintain register consistency in translations
- Apply domain-specific terminology
-
Content Classification
- Automatically categorize text by formality and domain
- Route content to appropriate reviewers or systems
-
Corpus Linguistics
- Build tagged corpora for linguistic research
- Study register and domain variation
- Subtag Length: Each subtag after
x-must be 8 characters or fewer - Order: Classifiers can appear in any order after
x- - Uniqueness: Each classifier type should appear only once per tag (except
confwhich can appear multiple times) - Case: Tags should be lowercase (case-insensitive per BCP 47)
- Magic Tags: Short form tags are single characters;
q,3-9are reserved for future use - Mixing: Long and short forms can be mixed within the same tag
- Proto Tags: Must start with
x-and SHOULD use ISO 639-5 codes when available (e.g.,x-proto-slanotx-proto-slavic) - Confidence: The
conf/cclassifier applies to the immediately preceding classifier - Numeric Values: Must be within defined ranges (0-5 for taboo, 0-10 for expertise, 0-100 for percentage values)
- Date Format: Dates use ISO 8601 without punctuation (YYYY, YYYYMM, or YYYYMMDD)
LVTag format is fully compatible with:
- BCP 47 (RFC 5646)
- ISO 639 language codes
- IANA Language Subtag Registry
- Unicode CLDR
- Precision: Enables fine-grained language variety identification
- Extensibility: New registers and domains can be added
- Standards-based: Built on established BCP 47 private-use mechanism
- Machine-readable: Systematic format enables automated processing
- Human-readable: Clear, descriptive subtags
- Flexibility: Support for both verbose long-form and concise short-form tags
- Brevity: Short magic tags enable compact representation while maintaining clarity
LVTag is designed to evolve with the needs of the language technology community. We welcome suggestions for new classifiers, improvements to existing ones, and real-world implementation feedback.
To propose extensions or contribute to the specification:
- Open an issue at github.com/lvtag/spec
- Join the discussion on existing proposals
- Share your implementation experiences
- Submit pull requests for documentation improvements
Reserved single-character codes (q, 3-9) are available for future standardized extensions.
This specification is released under the CC0 1.0 Universal (Public Domain Dedication).
Why CC0: To ensure maximum adoption and implementation freedom, LVTag is placed in the public domain. This means:
- No permission needed to use, implement, or modify
- No attribution required (though appreciated)
- No legal barriers for commercial or governmental use
- Compatible with all software licenses
- Used by major standards like Unicode CLDR
Patent Grant: Any patents covering the LVTag specification are hereby licensed royalty-free for any implementation that complies with this specification.
No Endorsement: Use of LVTag does not imply endorsement by the specification authors.
To the extent possible under law, Danslav Slavenskoj has waived all copyright and related or neighboring rights to the Language Variant Tag (LVTag) Format Specification. This work is published from: United States of America.