LLM-generated CJK corruption linter — catches valid-but-wrong kanji/hanzi that grep and tests miss
Project description
mojihen
LLM-generated CJK corruption linter. Catches valid-but-wrong kanji, hanzi, and hangul that language models emit silently — the class of bug that grep, unit tests, and every existing Unicode safety tool passes as false-green.
demo/sample.py:20:1 MH001 HIGH '闾' -> likely: 閾
'闾' is a known LLM corruption (likely intended: 閾) [rare_drift]
demo/sample.py:23:1 MH001 HIGH '耒' -> likely: 耐
'耒' is a known LLM corruption (likely intended: 耐) [decomposition]
The problem
When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes — it substitutes a real, valid character that looks or sounds close to the intended one. The wrong glyph is itself a legitimate Unicode codepoint.
Six observed cases (LLM-generated Japanese)
| Intended | LLM emitted | Class | Why it hid |
|---|---|---|---|
| 閾 (threshold) | 闾 (village gate, U+95FE) | rare drift | 閾 is uncommon; LLM drifted to adjacent codepoint |
| 耐 (endure) | 耒 (plow radical, U+8012) | decomposition | 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA |
| 滞 (stagnation) | 滹 (river name, U+6EF9) | radical | Radical visual confusion |
| 亊 (rare variant) | 事 (matter) | rare variant | U+4E8A vs U+4E8B, adjacent, visually identical |
| 愛 (love) | 感 (feeling) | visual/semantic | Both common; low-confidence in corpus (see below) |
| 敏 (nimble) | 敢 (bold) | shape | Stroke near-miss; low-confidence |
Why existing tools miss it
- grep / ripgrep: searches for the intended string; the wrong glyph simply does not match. Silent.
- Unit tests: assertions were written against the already-corrupted value. They pass. This actually happened.
- Unicode safety linters (
bidichk,anti-trojan-source,unicode-safety-check): target adversarial unicode (invisible chars, bidi overrides, homoglyphs). These substitutions are visible, in-script, non-adversarial. Out of scope for those tools. - Chinese Spell Check (CSC) research: models that correct human typos; not packaged as a dev linter / CI gate / agent hook.
mojihen is first-in-category for this failure mode.
Install
pip install mojihen
Python 3.9+ required. Zero runtime dependencies beyond stdlib.
(tomllib is used on Python 3.11+; on older versions, config file parsing
gracefully degrades to defaults if tomli is not installed.)
CLI usage
# Scan a file or directory
mojihen src/
# Scan with explicit options
mojihen src/ --format tty --fail-on high
# Output machine-readable JSON
mojihen src/ --format json > findings.json
# Output SARIF (for GitHub code scanning)
mojihen src/ --format sarif > mojihen.sarif
# Scan all text (bypass type-aware extraction)
mojihen src/ --all-text
# Use a custom config
mojihen src/ --config path/to/mojihen.toml
Exit codes
| Code | Meaning |
|---|---|
| 0 | No findings at or above the fail threshold |
| 1 | One or more findings at or above the fail threshold |
| 2 | Usage error, or agent hook blocked a write |
pre-commit
Add to .pre-commit-config.yaml:
repos:
- repo: https://github.com/hryoma1217/mojihen
rev: v0.1.0
hooks:
- id: mojihen
This uses the bundled .pre-commit-hooks.yaml which runs
mojihen --fail-on high on every staged file.
GitHub Action
# .github/workflows/mojihen.yml
name: CJK corruption check
on: [push, pull_request]
jobs:
mojihen:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hryoma1217/mojihen@v0.1.0
with:
paths: src/
fail-on: high
format: sarif
sarif-output: mojihen.sarif
Findings appear in the GitHub Security tab (code scanning).
Agent hook (Claude Code / Codex)
The killer use-case: scan just-written text before it reaches the filesystem, and bounce corrupt output back to the model immediately.
Claude Code (PostToolUse)
In .claude/settings.json:
{
"hooks": {
"PostToolUse": [
{
"matcher": "Write|Edit",
"hooks": [
{ "type": "command", "command": "mojihen hook --stdin" }
]
}
]
}
}
Codex
In .codex/config.toml:
[hooks]
post_write = "mojihen hook --stdin"
What happens on corruption
mojihen: BLOCKED - LLM CJK corruption detected
src/strings.py:3:18 MH001 HIGH '闾' -> likely: 閾
src/strings.py:5:12 MH001 HIGH '耒' -> likely: 耐
Verify the intended CJK text and rewrite before proceeding.
The hook exits 2; the agent sees the block reason and retries with corrected text.
See hooks/claude-code.md and hooks/codex.md for full setup instructions.
Configuration
Create mojihen.toml in your project root (or [tool.mojihen] in pyproject.toml):
# mojihen.toml
fail_on = "high" # "high" | "medium"
langs = ["ja", "zh", "ko"]
extract = "auto" # "auto" (type-aware) | "all-text"
allow = [] # literal strings/chars to never flag
corpus = [] # extra corpus JSON paths
Inline suppression
Suppress findings on a specific line:
# Intentional use of the archaic character (corpus fixture)
FIXTURE = "闾" # mojihen: ignore
# Suppress only a specific rule
FIXTURE = "闾" # mojihen: ignore[MH001]
How the corpus works
src/mojihen/data/seed.json is a versioned, schema-validated list of known-wrong chars:
{
"version": 1,
"entries": [
{
"wrong": "闾",
"intended": ["閾"],
"lang": "ja",
"class": "rare_drift",
"evidence": "observed in LLM Japanese output",
"confidence": "high"
}
]
}
Confidence tiers
| Tier | Meaning | CLI behaviour |
|---|---|---|
high |
Rare char; near-zero false positives | Fails CI by default |
medium |
Somewhat common; context-dependent | Warns; optionally fails |
low |
Common char; production evidence but ambiguous | Info only |
High-confidence entries are chars like 闾 (U+95FE) that are essentially absent
from modern Japanese/Chinese text and almost certainly signal LLM drift.
Common kanji like 感 are kept at low to avoid flooding legitimate text with
false positives.
Contributing a new entry
- Confirm the wrong char is a known-bad substitution with evidence (build log, diff, screenshot).
- Confirm
wrong != intendedand both contain valid CJK. - Choose
"confidence": "high"only if the wrong char is rare in normal text. - Add to
src/mojihen/data/seed.jsonand run:python -m unittest discover -s tests - The precision gate (
test_precision.py) must still pass with zero MH001 high findings on the clean fixture sentences.
Detectors
| ID | Name | Confidence |
|---|---|---|
| MH001 | Corpus hit | high/medium/low (per entry) |
| MH002 | Mixed-script token (Han + Latin/Cyrillic in one identifier) | medium |
| MH003 | Isolated CJK in ASCII identifier / key / URL | medium |
| MH004 | Rare/archaic codepoint (needs Unihan freq table) | deferred |
| MH005 | Decomposition garble (needs radical table) | deferred |
MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are already covered by individual MH001 corpus entries.
Escape decoding
mojihen decodes all escape forms before inspecting text, because LLMs
frequently emit corrupted characters as \uXXXX escapes:
| Form | Example | Decoded |
|---|---|---|
\uXXXX |
\u95FE |
闾 |
\u{XXXXXX} |
\u{95FE} |
闾 |
| Surrogate pair | \uD83D\uDE00 |
😀 |
\xXX |
\x41 |
A |
| HTML decimal | 闾 |
闾 |
| HTML hex | 闾 |
闾 |
| Named entity | & |
& |
Limitations and false-positive controls
- Common kanji: Characters like
感(feeling),末(end),士(person) appear in thousands of legitimate Japanese words. They are only added to the corpus atlowconfidence. Use--fail-on high(the default) to avoid noise. - Context-free: mojihen does not understand grammar or intent — it pattern-
matches against a corpus. False positives in unusual text can be suppressed
with
allow = [...]in config or inline# mojihen: ignore. - MH002/MH003 are medium-confidence and require
--fail-on mediumto fail CI. They are informational by default. - The clean-corpus precision gate (
tests/test_precision.py) must stay green; this is the automated false-positive guard.
日本語について (Japanese section)
mojihen(文字変)は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる
「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。
grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も 正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。
mojihenは既知の誤用パターンを収録したコーパス(src/mojihen/data/seed.json)と、
エスケープ形式(\uXXXX、&#NNNN;等)のデコードを組み合わせて、
CI・pre-commit・AIエージェントのフック(PostToolUse)として動作します。
Development
git clone https://github.com/hryoma1217/mojihen
cd mojihen
pip install -e ".[dev]"
python -m unittest discover -s tests -v
License
MIT. Copyright 2026 hryoma1217.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mojihen-0.1.0.tar.gz.
File metadata
- Download URL: mojihen-0.1.0.tar.gz
- Upload date:
- Size: 37.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aeab4808be05ee26725e99d4553b4b8757ee3147969348163858f25737c934b
|
|
| MD5 |
0494dd8e8be337f72fa27718f1134222
|
|
| BLAKE2b-256 |
16358a2f519ce09b9443ddaac2566e487d28f4299a1a072bcc95176ce6a4d42d
|
File details
Details for the file mojihen-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mojihen-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d305278751bc1504261cfe754ef9c92883d6979782444991ce31ff2ab36c5eb
|
|
| MD5 |
98fe978cd78e3002b68a6b2e72ae41b2
|
|
| BLAKE2b-256 |
bb2b25dc813caccf65af6c5f2a97d69632b87cf972c809fc12daa27047635d6e
|