Skip to main content

Self-contained Simplified/Traditional Chinese converter — pure-Rust, data-embedded OpenCC reimplementation

Project description

zhhz

CI Parity OpenSSF Scorecard License

Self-contained Simplified/Traditional Chinese converter — a pure-Rust, data-embedded reimplementation of OpenCC.

zhhz converts between Simplified and Traditional Chinese (plus Taiwan, Hong Kong, and Japanese-shinjitai variants) using the OpenCC dictionaries, and detects the script variant of Chinese text. All dictionaries are embedded in the binary at compile time — one ~1.86 MB static binary (or 588 KB xz-compressed) with no runtime download and no separate data directory.

The name is a palindrome: zh hanzi, and zhuan huan han zi (转换汉字, "convert Chinese characters").

Why

OpenCC is the de-facto Chinese-conversion library. zhhz is a from-scratch Rust reimplementation built around the same dictionaries, designed to be:

  • One self-contained binary. Data is embedded via include_str!; nothing is fetched or installed alongside it.
  • Memory-safe by construction — pure Rust in the conversion core.
  • Friendly to custom conversion words at the highest priority, for terminology, branding, or domain vocabulary.
  • Tracked against upstream data via a pinned, reproducible sync script (scripts/sync-opencc.sh).

Designed for AI agents

zhhz is built first and foremost for AI agents (Claude, Cursor, custom LLM pipelines, batch jobs). The CLI is deliberately minimal:

  • No TUI, no progress bars, no spinners. Output is plain text on stdout; errors go to stderr. An agent can capture both and parse deterministically.
  • stdin / stdout friendly. Pipe text in, get text out. Files are positional arguments; - means stdin.
  • Stable, predictable, safe. Same input → byte-identical output every time. No network, no filesystem writes unless asked (--in-place), no temp files, no background processes.
  • Batch / filelist from stdin (chardet-style): <files>..., --files-from <PATH|->, -0 / --null, recursive directory walking.
  • Single self-contained binary. No native deps, no data files to ship alongside. Drop it in a container and it just works.

If you want a fancy interactive experience, this is the wrong tool — use OpenCC or a web demo. If you want a thing you can shell out to from a script or hand to an agent, this is it.

Install

Cargo

cargo install zhhz

Direct binary

curl -L https://github.com/ljh-sh/zhhz/releases/latest/download/zhhz-x86_64-unknown-linux-musl.tar.xz | tar xJ -
sudo mv zhhz-x86_64-unknown-linux-musl/bin/zhhz /usr/local/bin/

Build from source

Requires Rust 1.74+.

git clone https://github.com/ljh-sh/zhhz
cd zhhz
cargo build --release   # binary at target/release/zhhz

npm

npm install zhhz

Same conversion core, compiled to WebAssembly. Zero native deps; the OpenCC dictionaries are baked into the .wasm. See docs/npm.md for the full API and examples/node-usage/ for a runnable demo. The npm API surface is strictly richer than opencc-js (adds detect(), introspection, Converter factory class, semantic region flags).

Usage

echo '汉字' | zhhz                       # default s2t:  漢字
echo '漢字' | zhhz -c t2s                # t2s:          汉字
echo '信息' | zhhz -c s2twp              # s2twp:        資訊
zhhz -c s2t input.txt                   # convert a file
zhhz -c s2t -i input.txt                # rewrite in place
zhhz --list                             # list all configs

Configs (mirrors OpenCC):

config direction
s2t / t2s Simplified ↔ Traditional (OpenCC standard)
s2tw / tw2s Simplified ↔ Traditional (Taiwan)
s2twp / tw2sp …with Taiwan phrases
s2hk / hk2s Simplified ↔ Traditional (Hong Kong)
s2hkp / hk2sp …with Hong Kong phrases
t2tw / tw2t Traditional (standard) ↔ Taiwan
t2hk / hk2t Traditional (standard) ↔ Hong Kong
t2jp / jp2t Japanese Kyūjitai ↔ Shinjitai

Or use semantic region flags (--from / --to):

echo '汉字'   | zhhz --from cn-s --to cn-t      # 漢字
echo '信息'   | zhhz --from cn-s --to cn-tw     # 資訊 (Taiwan phrases)
echo '鼠标'   | zhhz --from cn-s --to cn-tw     # 滑鼠
echo '漢字'   | zhhz --from cn-tw --to cn-s     # simplified
echo '万与两' | zhhz --from jp-n --to cn-t      # 萬與兩

Regions: cn-s / cn-t / cn-tw / cn-hk / jp-t / jp-n.

Detect the script variant of Chinese text

echo '汉字计算机软件' | zhhz detect          # cn-s    57   -
echo '漢字計算機軟體' | zhhz detect          # cn-t    66   -
echo 'こんにちは世界' | zhhz detect          # jp-n    50   -
zhhz detect corpus.txt                      # cn-s    ...  corpus.txt
zhhz detect                                 # detect content piped on stdin

Output is tab-separated: <region>\t<confidence>\t<path>. Confidence is 0–100 (share of signature characters in the input). Region codes are the same six listed above, or unknown when there are no CJK characters / kana.

zhhz detect mirrors chardet's CLI: <files>... to detect each path, - (or no args) to detect stdin content, --files-from <PATH|-> to read a newline-separated list of paths, -0 / --null for NUL-separated lists, and recursive directory walking.

Custom dictionaries

A custom dictionary is a TSV file (key<TAB>value); # lines are ignored. Entries override the built-in tables at the highest priority:

# mywords.txt
# key	value
软件	軟體
独家	獨家

echo '买软件吃独家' | zhhz -c s2t --dict mywords.txt   # 買軟體喫獨家

Library

use zhhz::{Config, Converter};

let c = Converter::new(Config::S2t);
assert_eq!(c.convert("汉字"), "漢字");

// Custom words override the built-in tables.
let c = Converter::with_custom(Config::S2t, &[("软件".into(), "軟體".into())]);
assert_eq!(c.convert("买软件"), "買軟體");

The engine is pure Rust with a tiny dependency tree (serde_json, anyhow) and no filesystem or network access, so it is straightforward to bind from WASM and Python (both are on the roadmap).

Node.js / npm

npm install zhhz
import { convert, detect, Converter, listConfigs } from "zhhz";

console.log(convert("汉字", "s2t"));            // 漢字
console.log(detect("他去了西維珍尼亞州"));      // { region: "cn-hk", confidence: 70 }

const c = new Converter("s2twp");
console.log(c.convert("信息"));                 // 資訊
console.log(c.convertWithCustom("买软件", [["软件", "軟體"]])); // 買軟體

console.log(listConfigs()); // 16 OpenCC config names

The npm package ships the same engine compiled to WebAssembly; dictionaries are embedded, so there is no data directory to ship alongside and no network fetch at runtime. The surface is strictly richer than opencc-js (adds detect(), introspection, factory instance, semantic region flags). See docs/npm.md for the full reference and examples/node-usage/ for a runnable demo.

How it works

zhhz reproduces OpenCC's pipeline exactly:

  1. Segment the input with forward maximum matching (FMM) against the segmentation dictionary group.
  2. Convert each segment through an ordered chain of dictionary groups; each stage re-walks its segment with longest-prefix matching, emitting the first candidate on a match.

The group match semantics match OpenCC's PrefixMatch: the highest-priority dictionary with any prefix wins (priority dominates length across dictionaries; length dominates only within one dictionary).

The OpenCC build system generates five dictionaries at build time (reversed variant tables, a tofu-risk subset, and a regional-phrase projection). build.rs reproduces all five deterministically from the vendored source data, so data/ stays a pure mirror of upstream.

Data and licensing

Dictionary data is vendored from BYVoid/OpenCC (see data/UPSTREAM for the pinned commit) and is Apache-2.0, same as the source code. Re-vendor the latest upstream data with:

scripts/sync-opencc.sh            # master HEAD
scripts/sync-opencc.sh 1.3.1      # a specific tag/commit

Roadmap

  • Pure-Rust engine, all 16 OpenCC configs, embedded data, custom words
  • WASM build + npm package (wasm32-unknown-unknown) — npm install zhhz
  • Differential-fuzz harness proving output parity vs the opencc CLI
  • Python native extension (PyO3 / maturin)
  • Compact dictionary representation (FST / double-array) for smaller binaries

See ROADMAP.md.

Contributing

See CONTRIBUTING.md. Issues and PRs are welcome.

Security

See SECURITY.md. For vulnerabilities, email lijunhao@x-cmd.com rather than opening a public issue.

License

Apache 2.0 — see LICENSE. Dictionary data is Apache-2.0, vendored from OpenCC.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhhz-0.7.8.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhhz-0.7.8-cp312-cp312-manylinux_2_34_x86_64.whl (847.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file zhhz-0.7.8.tar.gz.

File metadata

  • Download URL: zhhz-0.7.8.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for zhhz-0.7.8.tar.gz
Algorithm Hash digest
SHA256 313d05abd5fccca2821c64b5094dc6c7b1b70ecc586ed5799b87229c45eefd8f
MD5 e584d2f53382a54562c431e06692259e
BLAKE2b-256 0c576952781d4a4f50f8dab39ffaec2d268ba2929096439e58edb85697a2774a

See more details on using hashes here.

File details

Details for the file zhhz-0.7.8-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zhhz-0.7.8-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d6f640a379fdd1b9d17bc24fdac21efec06da2d4628ef381871b00a182ba7567
MD5 2e9f237cc8553e869d52c4f94c1e7c8e
BLAKE2b-256 33d97a4d7f10d8ea07f067b639c2605e6cc7005e6577b107d7a7be11c84a3c85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page