Self-contained Simplified/Traditional Chinese converter — pure-Rust, data-embedded OpenCC reimplementation
Project description
zhhz
Self-contained Simplified/Traditional Chinese converter — a pure-Rust, data-embedded reimplementation of OpenCC.
zhhz converts between Simplified and Traditional Chinese (plus Taiwan, Hong Kong, and Japanese-shinjitai variants) using the OpenCC dictionaries, and detects the script variant of Chinese text. All dictionaries are embedded in the binary at compile time — one ~1.86 MB static binary (or 588 KB xz-compressed) with no runtime download and no separate data directory.
The name is a palindrome: zh hanzi, and zhuan huan han zi (转换汉字, "convert Chinese characters").
Why
OpenCC is the de-facto Chinese-conversion library. zhhz is a from-scratch
Rust reimplementation built around the same dictionaries, designed to be:
- One self-contained binary. Data is embedded via
include_str!; nothing is fetched or installed alongside it. - Memory-safe by construction — pure Rust in the conversion core.
- Friendly to custom conversion words at the highest priority, for terminology, branding, or domain vocabulary.
- Tracked against upstream data via a pinned, reproducible sync script (
scripts/sync-opencc.sh).
Designed for AI agents
zhhz is built first and foremost for AI agents (Claude, Cursor, custom
LLM pipelines, batch jobs). The CLI is deliberately minimal:
- No TUI, no progress bars, no spinners. Output is plain text on stdout; errors go to stderr. An agent can capture both and parse deterministically.
- stdin / stdout friendly. Pipe text in, get text out. Files are positional
arguments;
-means stdin. - Stable, predictable, safe. Same input → byte-identical output every time.
No network, no filesystem writes unless asked (
--in-place), no temp files, no background processes. - Batch / filelist from stdin (chardet-style):
<files>...,--files-from <PATH|->,-0/--null, recursive directory walking. - Single self-contained binary. No native deps, no data files to ship alongside. Drop it in a container and it just works.
If you want a fancy interactive experience, this is the wrong tool — use OpenCC or a web demo. If you want a thing you can shell out to from a script or hand to an agent, this is it.
Install
Cargo
cargo install zhhz
Direct binary
curl -L https://github.com/ljh-sh/zhhz/releases/latest/download/zhhz-x86_64-unknown-linux-musl.tar.xz | tar xJ -
sudo mv zhhz-x86_64-unknown-linux-musl/bin/zhhz /usr/local/bin/
Build from source
Requires Rust 1.74+.
git clone https://github.com/ljh-sh/zhhz
cd zhhz
cargo build --release # binary at target/release/zhhz
npm
npm install zhhz
Same conversion core, compiled to WebAssembly. Zero native deps; the
OpenCC dictionaries are baked into the .wasm. See
docs/npm.md for the full API and
examples/node-usage/ for a runnable demo.
The npm API surface is strictly richer than opencc-js (adds
detect(), introspection, Converter factory class, semantic region
flags).
Usage
echo '汉字' | zhhz # default s2t: 漢字
echo '漢字' | zhhz -c t2s # t2s: 汉字
echo '信息' | zhhz -c s2twp # s2twp: 資訊
zhhz -c s2t input.txt # convert a file
zhhz -c s2t -i input.txt # rewrite in place
zhhz --list # list all configs
Configs (mirrors OpenCC):
| config | direction |
|---|---|
s2t / t2s |
Simplified ↔ Traditional (OpenCC standard) |
s2tw / tw2s |
Simplified ↔ Traditional (Taiwan) |
s2twp / tw2sp |
…with Taiwan phrases |
s2hk / hk2s |
Simplified ↔ Traditional (Hong Kong) |
s2hkp / hk2sp |
…with Hong Kong phrases |
t2tw / tw2t |
Traditional (standard) ↔ Taiwan |
t2hk / hk2t |
Traditional (standard) ↔ Hong Kong |
t2jp / jp2t |
Japanese Kyūjitai ↔ Shinjitai |
Or use semantic region flags (--from / --to):
echo '汉字' | zhhz --from cn-s --to cn-t # 漢字
echo '信息' | zhhz --from cn-s --to cn-tw # 資訊 (Taiwan phrases)
echo '鼠标' | zhhz --from cn-s --to cn-tw # 滑鼠
echo '漢字' | zhhz --from cn-tw --to cn-s # simplified
echo '万与两' | zhhz --from jp-n --to cn-t # 萬與兩
Regions: cn-s / cn-t / cn-tw / cn-hk / jp-t / jp-n.
Detect the script variant of Chinese text
echo '汉字计算机软件' | zhhz detect # cn-s 57 -
echo '漢字計算機軟體' | zhhz detect # cn-t 66 -
echo 'こんにちは世界' | zhhz detect # jp-n 50 -
zhhz detect corpus.txt # cn-s ... corpus.txt
zhhz detect # detect content piped on stdin
Output is tab-separated: <region>\t<confidence>\t<path>. Confidence is 0–100
(share of signature characters in the input). Region codes are the same six
listed above, or unknown when there are no CJK characters / kana.
zhhz detect mirrors chardet's CLI:
<files>... to detect each path, - (or no args) to detect stdin content,
--files-from <PATH|-> to read a newline-separated list of paths, -0 /
--null for NUL-separated lists, and recursive directory walking.
Custom dictionaries
A custom dictionary is a TSV file (key<TAB>value); # lines are ignored.
Entries override the built-in tables at the highest priority:
# mywords.txt
# key value
软件 軟體
独家 獨家
echo '买软件吃独家' | zhhz -c s2t --dict mywords.txt # 買軟體喫獨家
Library
use zhhz::{Config, Converter};
let c = Converter::new(Config::S2t);
assert_eq!(c.convert("汉字"), "漢字");
// Custom words override the built-in tables.
let c = Converter::with_custom(Config::S2t, &[("软件".into(), "軟體".into())]);
assert_eq!(c.convert("买软件"), "買軟體");
The engine is pure Rust with a tiny dependency tree (serde_json, anyhow) and no
filesystem or network access, so it is straightforward to bind from WASM and Python
(both are on the roadmap).
Node.js / npm
npm install zhhz
import { convert, detect, Converter, listConfigs } from "zhhz";
console.log(convert("汉字", "s2t")); // 漢字
console.log(detect("他去了西維珍尼亞州")); // { region: "cn-hk", confidence: 70 }
const c = new Converter("s2twp");
console.log(c.convert("信息")); // 資訊
console.log(c.convertWithCustom("买软件", [["软件", "軟體"]])); // 買軟體
console.log(listConfigs()); // 16 OpenCC config names
The npm package ships the same engine compiled to WebAssembly; dictionaries
are embedded, so there is no data directory to ship alongside and no network
fetch at runtime. The surface is strictly richer than opencc-js (adds
detect(), introspection, factory instance, semantic region flags).
See docs/npm.md for the full reference and
examples/node-usage/ for a runnable demo.
How it works
zhhz reproduces OpenCC's pipeline exactly:
- Segment the input with forward maximum matching (FMM) against the segmentation dictionary group.
- Convert each segment through an ordered chain of dictionary groups; each stage re-walks its segment with longest-prefix matching, emitting the first candidate on a match.
The group match semantics match OpenCC's PrefixMatch: the highest-priority
dictionary with any prefix wins (priority dominates length across dictionaries;
length dominates only within one dictionary).
The OpenCC build system generates five dictionaries at build time (reversed
variant tables, a tofu-risk subset, and a regional-phrase projection). build.rs
reproduces all five deterministically from the vendored source data, so data/
stays a pure mirror of upstream.
Data and licensing
Dictionary data is vendored from BYVoid/OpenCC
(see data/UPSTREAM for the pinned commit) and is
Apache-2.0, same as the source code. Re-vendor the latest upstream data with:
scripts/sync-opencc.sh # master HEAD
scripts/sync-opencc.sh 1.3.1 # a specific tag/commit
Roadmap
- Pure-Rust engine, all 16 OpenCC configs, embedded data, custom words
- WASM build + npm package (
wasm32-unknown-unknown) —npm install zhhz - Differential-fuzz harness proving output parity vs the
openccCLI - Python native extension (PyO3 /
maturin) - Compact dictionary representation (FST / double-array) for smaller binaries
See ROADMAP.md.
Contributing
See CONTRIBUTING.md. Issues and PRs are welcome.
Security
See SECURITY.md. For vulnerabilities, email lijunhao@x-cmd.com rather than opening a public issue.
License
Apache 2.0 — see LICENSE. Dictionary data is Apache-2.0, vendored from OpenCC.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zhhz-0.7.9.tar.gz.
File metadata
- Download URL: zhhz-0.7.9.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5218550ef6a280288764327b70c909cecc3d207ed2dfad686001f2e2da4e405
|
|
| MD5 |
9b01fc687acaa91b9c5ecc0025cf8e2b
|
|
| BLAKE2b-256 |
996231effce4417a46684660deea8f757180e0c20b9b44fa43caa5ff44bfeb6f
|
File details
Details for the file zhhz-0.7.9-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: zhhz-0.7.9-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 847.7 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dc3db2385d18b6eaaded41408c4024e12f44df0c4639cb255e5747a6278f939
|
|
| MD5 |
11594c377289a7e3325aaaf8e250f3d1
|
|
| BLAKE2b-256 |
a3e36d91122e5397d6ec0c6466d82b563116d242b5c5b7107477f4dc767c40c1
|