LLM-generated CJK corruption linter — catches valid-but-wrong kanji/hanzi that grep and tests miss

These details have not been verified by PyPI

Project description

mojihen

LLM-generated CJK corruption linter. Catches valid-but-wrong kanji, hanzi, and hangul that language models emit silently — the class of bug that grep, unit tests, and every existing Unicode safety tool passes as false-green.

demo/sample.py:20:1  MH001 HIGH  '闾'  -> likely: 閾
  '闾' is a known LLM corruption (likely intended: 閾)  [rare_drift]
demo/sample.py:23:1  MH001 HIGH  '耒'  -> likely: 耐
  '耒' is a known LLM corruption (likely intended: 耐)  [decomposition]

The problem

When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes — it substitutes a real, valid character that looks or sounds close to the intended one. The wrong glyph is itself a legitimate Unicode codepoint.

Six observed cases (LLM-generated Japanese)

Intended	LLM emitted	Class	Why it hid
閾 (threshold)	闾 (village gate, U+95FE)	rare drift	閾 is uncommon; LLM drifted to adjacent codepoint
耐 (endure)	耒 (plow radical, U+8012)	decomposition	耐→耒耗 radical fragment; 耒 alone near-absent in modern JA
滞 (stagnation)	滹 (river name, U+6EF9)	radical	Radical visual confusion
亊 (rare variant)	事 (matter)	rare variant	U+4E8A vs U+4E8B, adjacent, visually identical
愛 (love)	感 (feeling)	visual/semantic	Both common; low-confidence in corpus (see below)
敏 (nimble)	敢 (bold)	shape	Stroke near-miss; low-confidence

Why existing tools miss it

grep / ripgrep: searches for the intended string; the wrong glyph simply does not match. Silent.
Unit tests: assertions were written against the already-corrupted value. They pass. This actually happened.
Unicode safety linters (bidichk, anti-trojan-source, unicode-safety-check): target adversarial unicode (invisible chars, bidi overrides, homoglyphs). These substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
Chinese Spell Check (CSC) research: models that correct human typos; not packaged as a dev linter / CI gate / agent hook.

mojihen is first-in-category for this failure mode.

Install

pip install mojihen

Python 3.9+ required. Zero runtime dependencies beyond stdlib. (tomllib is used on Python 3.11+; on older versions, config file parsing gracefully degrades to defaults if tomli is not installed.)

CLI usage

# Scan a file or directory
mojihen src/

# Scan with explicit options
mojihen src/ --format tty --fail-on high

# Output machine-readable JSON
mojihen src/ --format json > findings.json

# Output SARIF (for GitHub code scanning)
mojihen src/ --format sarif > mojihen.sarif

# Scan all text (bypass type-aware extraction)
mojihen src/ --all-text

# Use a custom config
mojihen src/ --config path/to/mojihen.toml

Exit codes

Code	Meaning
0	No findings at or above the fail threshold
1	One or more findings at or above the fail threshold
2	Usage error, or agent hook blocked a write

pre-commit

Add to .pre-commit-config.yaml:

repos:
  - repo: https://github.com/hryoma1217/mojihen
    rev: v0.1.0
    hooks:
      - id: mojihen

This uses the bundled .pre-commit-hooks.yaml which runs mojihen --fail-on high on every staged file.

GitHub Action

# .github/workflows/mojihen.yml
name: CJK corruption check
on: [push, pull_request]

jobs:
  mojihen:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hryoma1217/mojihen@v0.1.0
        with:
          paths: src/
          fail-on: high
          format: sarif
          sarif-output: mojihen.sarif

Findings appear in the GitHub Security tab (code scanning).

Agent hook (Claude Code / Codex)

The killer use-case: scan just-written text before it reaches the filesystem, and bounce corrupt output back to the model immediately.

Claude Code (PostToolUse)

In .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": "mojihen hook --stdin" }
        ]
      }
    ]
  }
}

Codex

In .codex/config.toml:

[hooks]
post_write = "mojihen hook --stdin"

What happens on corruption

mojihen: BLOCKED - LLM CJK corruption detected

  src/strings.py:3:18  MH001 HIGH  '闾'  -> likely: 閾
  src/strings.py:5:12  MH001 HIGH  '耒'  -> likely: 耐

  Verify the intended CJK text and rewrite before proceeding.

The hook exits 2; the agent sees the block reason and retries with corrected text.

See hooks/claude-code.md and hooks/codex.md for full setup instructions.

Configuration

Create mojihen.toml in your project root (or [tool.mojihen] in pyproject.toml):

# mojihen.toml
fail_on = "high"              # "high" | "medium"
langs = ["ja", "zh", "ko"]
extract = "auto"              # "auto" (type-aware) | "all-text"
allow = []                    # literal strings/chars to never flag
corpus = []                   # extra corpus JSON paths

Inline suppression

Suppress findings on a specific line:

# Intentional use of the archaic character (corpus fixture)
FIXTURE = "闾"  # mojihen: ignore

# Suppress only a specific rule
FIXTURE = "闾"  # mojihen: ignore[MH001]

How the corpus works

src/mojihen/data/seed.json is a versioned, schema-validated list of known-wrong chars:

{
  "version": 1,
  "entries": [
    {
      "wrong": "闾",
      "intended": ["閾"],
      "lang": "ja",
      "class": "rare_drift",
      "evidence": "observed in LLM Japanese output",
      "confidence": "high"
    }
  ]
}

Confidence tiers

Tier	Meaning	CLI behaviour
`high`	Rare char; near-zero false positives	Fails CI by default
`medium`	Somewhat common; context-dependent	Warns; optionally fails
`low`	Common char; production evidence but ambiguous	Info only

High-confidence entries are chars like 闾 (U+95FE) that are essentially absent from modern Japanese/Chinese text and almost certainly signal LLM drift. Common kanji like 感 are kept at low to avoid flooding legitimate text with false positives.

Contributing a new entry

Confirm the wrong char is a known-bad substitution with evidence (build log, diff, screenshot).
Confirm wrong != intended and both contain valid CJK.
Choose "confidence": "high" only if the wrong char is rare in normal text.
Add to src/mojihen/data/seed.json and run: python -m unittest discover -s tests
The precision gate (test_precision.py) must still pass with zero MH001 high findings on the clean fixture sentences.

Detectors

ID	Name	Confidence
MH001	Corpus hit	high/medium/low (per entry)
MH002	Mixed-script token (Han + Latin/Cyrillic in one identifier)	medium
MH003	Isolated CJK in ASCII identifier / key / URL	medium
MH004	Rare/archaic codepoint (needs Unihan freq table)	deferred
MH005	Decomposition garble (needs radical table)	deferred

MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are already covered by individual MH001 corpus entries.

Escape decoding

mojihen decodes all escape forms before inspecting text, because LLMs frequently emit corrupted characters as \uXXXX escapes:

Form	Example	Decoded
`\uXXXX`	`\u95FE`	闾
`\u{XXXXXX}`	`\u{95FE}`	闾
Surrogate pair	`\uD83D\uDE00`	😀
`\xXX`	`\x41`	A
HTML decimal	`闾`	闾
HTML hex	`闾`	闾
Named entity	`&`	&

Limitations and false-positive controls

Common kanji: Characters like 感 (feeling), 末 (end), 士 (person) appear in thousands of legitimate Japanese words. They are only added to the corpus at low confidence. Use --fail-on high (the default) to avoid noise.
Context-free: mojihen does not understand grammar or intent — it pattern- matches against a corpus. False positives in unusual text can be suppressed with allow = [...] in config or inline # mojihen: ignore.
MH002/MH003 are medium-confidence and require --fail-on medium to fail CI. They are informational by default.
The clean-corpus precision gate (tests/test_precision.py) must stay green; this is the automated false-positive guard.

日本語について (Japanese section)

mojihen（文字変）は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。

grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。

mojihenは既知の誤用パターンを収録したコーパス（src/mojihen/data/seed.json）と、エスケープ形式（\uXXXX、&#NNNN;等）のデコードを組み合わせて、 CI・pre-commit・AIエージェントのフック（PostToolUse）として動作します。

Development

git clone https://github.com/hryoma1217/mojihen
cd mojihen
pip install -e ".[dev]"
python -m unittest discover -s tests -v

License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mojihen-0.1.0.tar.gz (37.9 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mojihen-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file mojihen-0.1.0.tar.gz.

File metadata

Download URL: mojihen-0.1.0.tar.gz
Upload date: Jun 27, 2026
Size: 37.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for mojihen-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3aeab4808be05ee26725e99d4553b4b8757ee3147969348163858f25737c934b`
MD5	`0494dd8e8be337f72fa27718f1134222`
BLAKE2b-256	`16358a2f519ce09b9443ddaac2566e487d28f4299a1a072bcc95176ce6a4d42d`

See more details on using hashes here.

File details

Details for the file mojihen-0.1.0-py3-none-any.whl.

File metadata

Download URL: mojihen-0.1.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for mojihen-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d305278751bc1504261cfe754ef9c92883d6979782444991ce31ff2ab36c5eb`
MD5	`98fe978cd78e3002b68a6b2e72ae41b2`
BLAKE2b-256	`bb2b25dc813caccf65af6c5f2a97d69632b87cf972c809fc12daa27047635d6e`

See more details on using hashes here.

mojihen 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

mojihen

The problem

Six observed cases (LLM-generated Japanese)

Why existing tools miss it

Install

CLI usage

Exit codes

pre-commit

GitHub Action

Agent hook (Claude Code / Codex)

Claude Code (PostToolUse)

Codex

What happens on corruption

Configuration

Inline suppression

How the corpus works

Confidence tiers

Contributing a new entry

Detectors

Escape decoding

Limitations and false-positive controls

日本語について (Japanese section)

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes