Thanks to visit codestin.com
Credit goes to github.com

Skip to content

dahlia/gukhanmun

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

249 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gukhanmun

crates.io JSR npm GitHub Actions License: GPL-3.0-only GitHub Sponsors

Also available: 韓國語 (Korean).

Gukhanmun converts Korean text written in mixed script (國漢文混用體, hanja characters interleaved with hangul) into hangul-only text. It is the successor to Seonbi, narrowed to the hanja-to-hangul conversion pipeline and extended along several axes: streaming I/O, pluggable dictionaries, lattice-based segmentation, and a wider range of output formats. The library is implemented in Rust and exposed as a Rust library, a command-line tool, a WebAssembly package, and a native Node-API addon.

Features

  • Lattice segmentation finds the best split rather than greedily taking the longest match. 行事場所 segments as 行事 + 場所, not 行事場 + 所.
  • Pluggable dictionaries: in-memory map, mmap-friendly FST files, or CDB files, composable via ChainDictionary.
  • The bundled Standard Korean Language Dictionary (標準國語大辭典) and Open Korean Dictionary (우리말샘) ship as compiled FST/CDB, so there is nothing extra to download.
  • Format adapters for plain text, HTML fragments, and Markdown. The engine is format-neutral; adapters handle parsing and serialization.
  • Five rendering modes: hangul-only, hangul(hanja) parentheses, hanja(hangul) parentheses, ruby markup, and original mixed script with selective glossing.
  • Streaming-first: the engine buffers only within a single hanja conversion span, not the whole document.
  • Initial sound law (頭音法則) for South Korean orthography, applied to fallback readings. Dictionary entries encode the correct reading already.
  • The core crate (gukhanmun-core) is no_std + alloc, suitable for embedded targets.
  • JavaScript and TypeScript bindings ship in two flavours: a WebAssembly package that runs in browsers, Deno, Node.js, Bun, and edge runtimes, and a native Node-API addon for higher server-side throughput.

Installation

Command-line tool

Via mise

If you use mise, install a prebuilt binary with a single command:

mise use -g aqua:dahlia/gukhanmun

The -g flag installs it globally. Omit it to activate the tool only in the current project directory.

Via Windows Package Manager (winget)

On Windows, install via winget:

winget install HongMinhee.Gukhanmun

From crates.io

If you have a Rust toolchain installed, install from crates.io:

cargo install gukhanmun-cli gukhanmun-mkdict

This compiles the binaries and places them in ~/.cargo/bin/. Make sure that directory is on your PATH.

Prebuilt binaries

Prebuilt binaries for Linux (x86_64, aarch64), macOS (x86_64, aarch64), and Windows (x86_64) are attached to each release on GitHub:

https://github.com/dahlia/gukhanmun/releases

Download the archive for your platform, extract it, and place the gukhanmun and gukhanmun-mkdict executables somewhere on your PATH.

Rust library

Add to Cargo.toml:

cargo add gukhanmun-core

Optionally add format adapters and dictionary backends:

cargo add gukhanmun-html gukhanmun-markdown
cargo add gukhanmun-stdict gukhanmun-opendict
cargo add gukhanmun-fst gukhanmun-cdb

JavaScript/TypeScript library

Install the WebAssembly package for most JavaScript environments:

npm  add       @gukhanmun/wasm @gukhanmun/stdict-fst
pnpm add       @gukhanmun/wasm @gukhanmun/stdict-fst
yarn add       @gukhanmun/wasm @gukhanmun/stdict-fst
bun  add       @gukhanmun/wasm @gukhanmun/stdict-fst
deno add --jsr @gukhanmun/wasm @gukhanmun/stdict-fst

Of you need better server-side performance and don't mind a native dependency, install the Node-API package instead:

npm  add     @gukhanmun/napi     @gukhanmun/stdict-fst
pnpm add     @gukhanmun/napi     @gukhanmun/stdict-fst
yarn add     @gukhanmun/napi     @gukhanmun/stdict-fst
bun  add     @gukhanmun/napi     @gukhanmun/stdict-fst
deno add npm:@gukhanmun/napi jsr:@gukhanmun/stdict-fst

Quick start

echo "漢字 北京 標識" | gukhanmun
# → 한자 베이징 표지

echo "漢字" | gukhanmun --rendering hangul-hanja-parens
# → 한자(漢字)

echo "<p>漢字</p>" | gukhanmun --format text/html
# → <p>한자</p>
use gukhanmun_core::{MapDictionary, RenderMode, convert_plain_text};

let mut dict = MapDictionary::new();
dict.insert("漢字", "한자");

let output = convert_plain_text("漢字", &dict, RenderMode::HangulOnly);
assert_eq!(output, "한자");

For the full guide, including HTML/Markdown adapters, rendering modes, presets, and the JavaScript API, visit https://gukhanmun.org/.

Crates

The project is a Cargo workspace. All crates share the same version.

Crate Description
gukhanmun-core Format-neutral IR, engine, dictionary trait, lattice segmenter, fallback phoneticizer. no_std + alloc.
gukhanmun-html HTML fragment reader and writer; HtmlScopeData with lang inheritance and preserved-tag handling.
gukhanmun-markdown Markdown adapter over pulldown-cmark; inline HTML is re-scanned for lang attributes.
gukhanmun-fst FST-backed HanjaDictionary implementation for mmap-friendly on-disk dictionaries.
gukhanmun-cdb CDB-trie HanjaDictionary implementation; trivially auditable on-disk format.
gukhanmun-stdict The bundled Standard Korean Language Dictionary as an embedded FST byte array.
gukhanmun-opendict The bundled Open Korean Dictionary (우리말샘) data.
gukhanmun-dict-extract Shared dictionary dump extraction helpers.
gukhanmun-mkdict CLI tool to build FST or CDB dictionary files from TSV, CSV, or JSON Lines input.
gukhanmun-cli The gukhanmun command-line binary.

npm/JSR packages

The project also publishes seven JavaScript packages, all sharing the same version as the Rust crates.

Package JSR npm Description
@gukhanmun/types JSR npm TypeScript type declarations shared by the WASM and NAPI packages. No runtime code.
@gukhanmun/wasm JSR npm WebAssembly build. Runs in browsers, Deno, Node.js, and Bun.
@gukhanmun/napi npm Native Node.js addon via napi-rs. Faster than WASM for server-side use.
@gukhanmun/stdict-fst JSR npm Bundled Standard Korean Language Dictionary in FST format.
@gukhanmun/stdict-cdb JSR npm Bundled Standard Korean Language Dictionary in CDB format.
@gukhanmun/opendict-fst JSR npm Bundled Open Korean Dictionary categories in FST format.
@gukhanmun/opendict-cdb JSR npm Bundled Open Korean Dictionary categories in CDB format.

Design documentation

DESIGN.md covers the full architecture: intermediate representation, lattice segmentation algorithm, dictionary trait design, middleware system, and format adapter internals.

License

Distributed under GPL 3.0. See LICENSE.

About

A Rust/JavaScript library and CLI tool that converts mixed-script Korean text into hangul-only text

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Contributors