Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add Cangjie lexer#3108

Open
cy6erGn0m wants to merge 1 commit into
pygments:masterfrom
cy6erGn0m:cangjie
Open

Add Cangjie lexer#3108
cy6erGn0m wants to merge 1 commit into
pygments:masterfrom
cy6erGn0m:cangjie

Conversation

@cy6erGn0m

@cy6erGn0m cy6erGn0m commented Apr 27, 2026

Copy link
Copy Markdown

Summary

This PR adds a lexer for the Cangjie programming language, a modern general-purpose programming language developed by Huawei.

About Cangjie

Cangjie is a high-level, statically typed, multi-paradigm programming language featuring:

  • Strong type inference
  • Pattern matching with guards
  • First-class lambdas and closures
  • Advanced macro system with quote and imperative macros
  • FFI support for C, Java and ObjC
  • Modern syntax inspired by Swift, Rust, Kotlin, and Python

The language was first released in June 2024 and was open-sourced in July 2025. It supports multiple platforms including HarmonyOS, Linux, Windows, macOS, Android, and iOS.

Official resources:

Lexer Implementation

The lexer uses ExtendedRegexLexer to handle complex syntax features like multi-hash raw strings. Key features covered:

Category Features
Core syntax Packages, imports, variables (let/var/const), functions
Type definitions struct, class, interface, enum, extend, prop, type
Control flow if/else, while, for-in, do-while, match/case, try/catch/finally
Literals Numeric types with suffixes (i32, f64, etc.), hex/octal/binary, runes
Strings Regular strings, triple-quoted strings, raw strings (#"..."#, ##"..."##, etc.), interpolation (${...})
Operators Arithmetic, comparison, logical, bitwise, pipeline (`
FFI foreign func, unsafe blocks, CPointer<T>
Metaprogramming macro package, quote(...), $ interpolation, @expand, @Tuple, @Tokens
Annotations @Test, @Benchmark, @Override, @Deprecated, @Available, custom annotations with complex arguments

Technical highlights

  • Multi-hash raw strings: Uses a callback-based approach with LexerContext to efficiently match closing delimiters (#"..."#, ##"..."##, ###"..."###, etc.) with O(n) complexity instead of O(n²) backtracking
  • Annotation arguments: Handles arbitrary tokens inside @Annotation[...] including operators, lambdas { => }, match expressions, and nested brackets with proper balancing
  • String interpolation: Supports nested expressions including nested interpolations and raw string literals

Testing

The lexer includes comprehensive test coverage: 12 example files covering all major language features

Manual testing:

  • Validated against 500 stdlib files with zero parsing errors
  • Validated against 94,121 test suite files (LLT + HLT) from the official Cangjie compiler tests with zero errors

@cy6erGn0m cy6erGn0m marked this pull request as draft April 28, 2026 10:09
@cy6erGn0m cy6erGn0m marked this pull request as ready for review May 25, 2026 11:20
@cy6erGn0m

Copy link
Copy Markdown
Author

tested manually with various examples with latex\minted, seems to work well

image image

Implements an ExtendedRegexLexer for the Cangjie programming language
with 20+ states covering annotations, quotes, generics, strings, and
expressions.

Key features:
- Contextual keywords (abstract/open/sealed/override/redef/internal)
  highlighted as keywords before declarations, as identifiers elsewhere
- Triple-quoted strings, string interpolation (${expr}), raw strings
- Macro quote expressions with $() interpolation
- Generic parameters <T>, inheritance with <: operator
- VArray with $ prefix on size, number literals with underscores

12 test files with golden output covering all major syntax categories.
0 Error tokens across 500 stdlib files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant