Skip to main content

LLM-generated CJK corruption linter — catches valid-but-wrong kanji/hanzi that grep and tests miss

Project description

mojihen

LLM-generated CJK corruption linter. Catches valid-but-wrong kanji, hanzi, and hangul that language models emit silently — the class of bug that grep, unit tests, and every existing Unicode safety tool passes as false-green.

demo/sample.py:20:1  MH001 HIGH  '闾'  -> likely: 閾
  '闾' is a known LLM corruption (likely intended: 閾)  [rare_drift]
demo/sample.py:23:1  MH001 HIGH  '耒'  -> likely: 耐
  '耒' is a known LLM corruption (likely intended: 耐)  [decomposition]

The problem

When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes — it substitutes a real, valid character that looks or sounds close to the intended one. The wrong glyph is itself a legitimate Unicode codepoint.

Six observed cases (LLM-generated Japanese)

Intended LLM emitted Class Why it hid
閾 (threshold) 闾 (village gate, U+95FE) rare drift 閾 is uncommon; LLM drifted to adjacent codepoint
耐 (endure) 耒 (plow radical, U+8012) decomposition 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA
滞 (stagnation) 滹 (river name, U+6EF9) radical Radical visual confusion
亊 (rare variant) 事 (matter) rare variant U+4E8A vs U+4E8B, adjacent, visually identical
愛 (love) 感 (feeling) visual/semantic Both common; low-confidence in corpus (see below)
敏 (nimble) 敢 (bold) shape Stroke near-miss; low-confidence

Why existing tools miss it

  • grep / ripgrep: searches for the intended string; the wrong glyph simply does not match. Silent.
  • Unit tests: assertions were written against the already-corrupted value. They pass. This actually happened.
  • Unicode safety linters (bidichk, anti-trojan-source, unicode-safety-check): target adversarial unicode (invisible chars, bidi overrides, homoglyphs). These substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
  • Chinese Spell Check (CSC) research: models that correct human typos; not packaged as a dev linter / CI gate / agent hook.

mojihen is first-in-category for this failure mode.


Install

pip install mojihen

Python 3.9+ required. Zero runtime dependencies beyond stdlib. (tomllib is used on Python 3.11+; on older versions, config file parsing gracefully degrades to defaults if tomli is not installed.)


CLI usage

# Scan a file or directory
mojihen src/

# Scan with explicit options
mojihen src/ --format tty --fail-on high

# Output machine-readable JSON
mojihen src/ --format json > findings.json

# Output SARIF (for GitHub code scanning)
mojihen src/ --format sarif > mojihen.sarif

# Scan all text (bypass type-aware extraction)
mojihen src/ --all-text

# Use a custom config
mojihen src/ --config path/to/mojihen.toml

Exit codes

Code Meaning
0 No findings at or above the fail threshold
1 One or more findings at or above the fail threshold
2 Usage error, or agent hook blocked a write

pre-commit

Add to .pre-commit-config.yaml:

repos:
  - repo: https://github.com/hryoma1217/mojihen
    rev: v0.1.0
    hooks:
      - id: mojihen

This uses the bundled .pre-commit-hooks.yaml which runs mojihen --fail-on high on every staged file.


GitHub Action

# .github/workflows/mojihen.yml
name: CJK corruption check
on: [push, pull_request]

jobs:
  mojihen:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hryoma1217/mojihen@v0.1.0
        with:
          paths: src/
          fail-on: high
          format: sarif
          sarif-output: mojihen.sarif

Findings appear in the GitHub Security tab (code scanning).


Agent hook (Claude Code / Codex)

The killer use-case: scan just-written text before it reaches the filesystem, and bounce corrupt output back to the model immediately.

Claude Code (PostToolUse)

In .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": "mojihen hook --stdin" }
        ]
      }
    ]
  }
}

Codex

In .codex/config.toml:

[hooks]
post_write = "mojihen hook --stdin"

What happens on corruption

mojihen: BLOCKED - LLM CJK corruption detected

  src/strings.py:3:18  MH001 HIGH  '闾'  -> likely: 閾
  src/strings.py:5:12  MH001 HIGH  '耒'  -> likely: 耐

  Verify the intended CJK text and rewrite before proceeding.

The hook exits 2; the agent sees the block reason and retries with corrected text.

See hooks/claude-code.md and hooks/codex.md for full setup instructions.


Configuration

Create mojihen.toml in your project root (or [tool.mojihen] in pyproject.toml):

# mojihen.toml
fail_on = "high"              # "high" | "medium"
langs = ["ja", "zh", "ko"]
extract = "auto"              # "auto" (type-aware) | "all-text"
allow = []                    # literal strings/chars to never flag
corpus = []                   # extra corpus JSON paths

Inline suppression

Suppress findings on a specific line:

# Intentional use of the archaic character (corpus fixture)
FIXTURE = "闾"  # mojihen: ignore

# Suppress only a specific rule
FIXTURE = "闾"  # mojihen: ignore[MH001]

How the corpus works

src/mojihen/data/seed.json is a versioned, schema-validated list of known-wrong chars:

{
  "version": 1,
  "entries": [
    {
      "wrong": "闾",
      "intended": ["閾"],
      "lang": "ja",
      "class": "rare_drift",
      "evidence": "observed in LLM Japanese output",
      "confidence": "high"
    }
  ]
}

Confidence tiers

Tier Meaning CLI behaviour
high Rare char; near-zero false positives Fails CI by default
medium Somewhat common; context-dependent Warns; optionally fails
low Common char; production evidence but ambiguous Info only

High-confidence entries are chars like (U+95FE) that are essentially absent from modern Japanese/Chinese text and almost certainly signal LLM drift. Common kanji like are kept at low to avoid flooding legitimate text with false positives.

Contributing a new entry

  1. Confirm the wrong char is a known-bad substitution with evidence (build log, diff, screenshot).
  2. Confirm wrong != intended and both contain valid CJK.
  3. Choose "confidence": "high" only if the wrong char is rare in normal text.
  4. Add to src/mojihen/data/seed.json and run: python -m unittest discover -s tests
  5. The precision gate (test_precision.py) must still pass with zero MH001 high findings on the clean fixture sentences.

Detectors

ID Name Confidence
MH001 Corpus hit high/medium/low (per entry)
MH002 Mixed-script token (Han + Latin/Cyrillic in one identifier) medium
MH003 Isolated CJK in ASCII identifier / key / URL medium
MH004 Rare/archaic codepoint (needs Unihan freq table) deferred
MH005 Decomposition garble (needs radical table) deferred

MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are already covered by individual MH001 corpus entries.


Escape decoding

mojihen decodes all escape forms before inspecting text, because LLMs frequently emit corrupted characters as \uXXXX escapes:

Form Example Decoded
\uXXXX \u95FE
\u{XXXXXX} \u{95FE}
Surrogate pair \uD83D\uDE00 😀
\xXX \x41 A
HTML decimal 闾
HTML hex 闾
Named entity & &

Limitations and false-positive controls

  • Common kanji: Characters like (feeling), (end), (person) appear in thousands of legitimate Japanese words. They are only added to the corpus at low confidence. Use --fail-on high (the default) to avoid noise.
  • Context-free: mojihen does not understand grammar or intent — it pattern- matches against a corpus. False positives in unusual text can be suppressed with allow = [...] in config or inline # mojihen: ignore.
  • MH002/MH003 are medium-confidence and require --fail-on medium to fail CI. They are informational by default.
  • The clean-corpus precision gate (tests/test_precision.py) must stay green; this is the automated false-positive guard.

日本語について (Japanese section)

mojihen(文字変)は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる 「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。

grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も 正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。

mojihenは既知の誤用パターンを収録したコーパス(src/mojihen/data/seed.json)と、 エスケープ形式(\uXXXX&#NNNN;等)のデコードを組み合わせて、 CI・pre-commit・AIエージェントのフック(PostToolUse)として動作します。


Development

git clone https://github.com/hryoma1217/mojihen
cd mojihen
pip install -e ".[dev]"
python -m unittest discover -s tests -v

License

MIT. Copyright 2026 hryoma1217.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mojihen-0.1.0.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mojihen-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file mojihen-0.1.0.tar.gz.

File metadata

  • Download URL: mojihen-0.1.0.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for mojihen-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3aeab4808be05ee26725e99d4553b4b8757ee3147969348163858f25737c934b
MD5 0494dd8e8be337f72fa27718f1134222
BLAKE2b-256 16358a2f519ce09b9443ddaac2566e487d28f4299a1a072bcc95176ce6a4d42d

See more details on using hashes here.

File details

Details for the file mojihen-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mojihen-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for mojihen-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d305278751bc1504261cfe754ef9c92883d6979782444991ce31ff2ab36c5eb
MD5 98fe978cd78e3002b68a6b2e72ae41b2
BLAKE2b-256 bb2b25dc813caccf65af6c5f2a97d69632b87cf972c809fc12daa27047635d6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page