Japanese Morphological Analyzer and Romanization tool - Python remake of ichiran
Project description
Himotoki (紐解き)
Himotoki (紐解き, "unraveling") is a Python remake of ichiran, the comprehensive Japanese morphological analyzer. It segments Japanese text into words, provides dictionary definitions, and traces conjugation chains back to their root forms -- all powered by a portable SQLite backend.
Features
- Portable SQLite Backend -- No PostgreSQL setup required. Dictionary data lives in a single file (~3 GB) that is generated on first use.
- Dynamic-Programming Segmentation -- Uses a Viterbi-style algorithm to find the most linguistically plausible word boundaries.
- Deep Dictionary Integration -- Built on JMDict, providing glosses, part-of-speech tags, usage notes, and cross-references.
- Recursive Deconjugation -- Walks the conjugation database to trace inflected forms (passive, causative, te-form, negation, etc.) back to dictionary entries.
- Conjugation Breakdown Tree -- Displays each transformation step in a visual tree with the suffix, grammatical label, and English gloss.
- Compound Word Detection -- Recognizes suffix compounds (te-iru progressive, te-shimau completion, tai desiderative, sou appearance, etc.) and shows their internal structure.
- Scoring Engine -- Implements synergy and penalty heuristics from ichiran to resolve segmentation ambiguities.
Installation
pip install himotoki
First-Time Setup
On first run, Himotoki will offer to download JMDict and build the SQLite database. The process takes approximately 10-20 minutes and requires about 3 GB of free disk space.
himotoki "日本語テキスト"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Welcome to Himotoki!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
First-time setup required. This will:
- Download JMdict dictionary data (~15MB compressed)
- Generate optimized SQLite database (~3GB)
- Store data in ~/.himotoki/
Proceed with setup? [Y/n]:
Non-interactive setup for CI environments:
himotoki setup --yes
Usage
Command Line
# Default: dictionary info with conjugation breakdown
himotoki "学校で勉強しています"
* 学校 【がっこう】
1. [n] school
* で
1. [prt] at; in
2. [prt] at; when
3. [prt] by; with
* 勉強しています 【べんきょう しています】
1. [n,vs,vt] study
2. [n,vs,vi] diligence; working hard
└─ する (makes a verb from a noun)
└─ Conjunctive (~te) (て)
└─ いる (indicates continuing action (to be ...ing))
└─ Polite (ます)
# Full output: romanization + dictionary info + conjugation tree
himotoki -f "食べられなかった"
taberarenakatta
* taberarenakatta 食べられなかった 【たべられなかった】
1. [v1,vt] to eat
2. [v1,vt] to live on (e.g. a salary); to live off; to subsist on
← 食べる 【たべる】
└─ Potential/Passive (れる): can do / is done (to)
└─ Negative (ない): not
└─ Past (~ta) (かった): did/was
# Simple romanization
himotoki -r "学校で勉強しています"
# Output: gakkou de benkyou shiteimasu
# Kana reading with spaces
himotoki -k "学校で勉強しています"
# Output: がっこう で べんきょう しています
# JSON output for programmatic use
himotoki -j "学校で勉強しています"
Python API
import himotoki
# Optional: pre-warm caches for faster first request
himotoki.warm_up()
# Analyze Japanese text
results = himotoki.analyze("日本語を勉強しています")
for words, score in results:
for w in words:
print(f"{w.text} 【{w.kana}】 - {w.gloss[:50]}...")
Conjugation Breakdown
Himotoki traces conjugated words through every transformation step, showing the root form and each inflection applied:
$ himotoki -f "書かせられていた"
kakaserareteita
* kakaserareteita 書かせられていた 【かかせられていた】
1. [v5k,vt] to write; to compose; to pen
2. [v5k,vt] to draw; to paint
← 書く 【かく】
└─ Causative (かせ): makes do
└─ Passive (られる): is done (to)
└─ Conjunctive (~te, progressive) (て)
└─ Past (~ta) (た): did/was
A deeply nested chain parsed into its constituent parts:
$ himotoki -f "飲んでしまいたかった"
nondeshimaitakatta
* nondeshimaitakatta 飲んでしまいたかった 【のんでしまいたかった】
1. [v5m,vt] to drink; to swallow; to take (medicine)
← 飲む 【のむ】
└─ Conjunctive (~te) (んで)
└─ しまう (indicates completion / to do something by accident or regret)
└─ Continuative (~i) (い): and (stem)
└─ たい (want to... / would like to...)
└─ Past (~ta) (かった): did/was
Full sentence analysis with per-word dictionary entries and conjugation trees:
$ himotoki "学校で勉強しています"
* 学校 【がっこう】
1. [n] school
* で
1. [prt] at; in
2. [prt] at; when
3. [prt] by; with
* 勉強しています 【べんきょう しています】
1. [n,vs,vt] study
2. [n,vs,vi] diligence; working hard
← 勉強する 【べんきょうする】
└─ Conjunctive (~te, progressive) (て)
└─ Polite (ます)
Suffix Compounds
Himotoki recognizes productive suffix patterns and merges them into compound words with grammatical labels:
| Category | Suffixes | Example |
|---|---|---|
| Progressive / completive | ている, てある, てしまう, ておく | 食べている → "is eating" |
| Giving / receiving | てくれる, てもらう, てあげる, てやる | 読んであげる → "read for someone" |
| Desire / attempt | たい, てほしい, てみる | 食べたい → "want to eat" |
| Appearance / degree | そう, すぎる, っぽい, らしい | 高すぎる → "too expensive" |
| Compound verbs | 出す, 切る, 合う, 込む, 始める, 終わる | 食べ始める → "start eating" |
| Nominalization | さ, み, 方 | 深み → "depth" |
| Contractions | ちゃう, じゃう, とく, なきゃ | 食べちゃった → "ended up eating" |
| Polite / formal | です, でしょう, ください, いたす | 食べてください → "please eat" |
| Na-adjective suffixes | すぎる, っぽい, み, そう | 静かすぎる → "too quiet" |
How It Works
Himotoki processes Japanese text through three stages:
-
Segmentation -- A dynamic-programming algorithm considers all possible word boundaries and selects the highest-scoring path. Scoring uses dictionary frequency data, part-of-speech synergies, and penalty heuristics ported from ichiran.
-
Suffix Compound Assembly -- Adjacent segments are checked against known suffix patterns (te-iru, te-shimau, tai, sou, etc.). Matching segments are merged into compound WordInfo objects with preserved component structure.
-
Conjugation Chain Resolution -- For each conjugated word, the system queries the conjugation database to walk the
viachain from the surface form back to the dictionary entry. Each step records the conjugation type, suffix text, and English gloss, then formats the result as an indented tree.
Project Structure
himotoki/
segment.py # Viterbi-style segmentation engine
lookup.py # Dictionary lookup, scoring, conjugation data
output.py # WordInfo, conjugation tree, JSON/text formatting
suffixes.py # Suffix compound detection (te-iru, tai, etc.)
synergies.py # Part-of-speech synergy and penalty rules
conjugation_hints.py # Supplementary conjugation patterns
constants.py # Conjugation type IDs, POS tags, glosses
characters.py # Kana/kanji conversion, romanization
counters.py # Japanese counter expression handling
cli.py # Command-line interface
db/ # SQLAlchemy models and connection management
loading/ # JMDict XML parsing and database generation
scripts/
llm_eval.py # LLM-based accuracy evaluation (510 sentences)
check_segments.py # Quick segmentation change checker
llm_report.py # HTML report generator
tests/ # 433 tests (pytest + hypothesis)
data/ # Dictionary data, evaluation datasets
Development
Setup
git clone https://github.com/msr2903/himotoki.git
cd himotoki
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -x --tb=short
Testing
# Run all tests
pytest tests/ -x --tb=short
# Run conjugation tree tests only
pytest tests/test_conjugation_tree.py -v
# Run with coverage
pytest tests/ --cov=himotoki --cov-report=term-missing
LLM Accuracy Evaluation
The project includes an LLM-based evaluation system that scores segmentation accuracy against 510 curated Japanese sentences:
python scripts/llm_eval.py --quick # 50-sentence subset
python scripts/llm_eval.py # Full evaluation
python scripts/llm_eval.py --rescore 5 # Re-evaluate entry #5
python scripts/llm_report.py # Generate HTML report
Current accuracy: 510/510 (100%) on the v3 evaluation prompt.
License
Distributed under the MIT License. See LICENSE for details.
Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file himotoki-0.3.1.tar.gz.
File metadata
- Download URL: himotoki-0.3.1.tar.gz
- Upload date:
- Size: 218.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
066fa963d433b6028750bc41f9d16ce5c4006ea74813d20397e369eabb947bad
|
|
| MD5 |
fe7249848bd58e326592df325123b04d
|
|
| BLAKE2b-256 |
6300e0ecd2c3cec810965d033151a93db0e09f8c312420c36d9cbdfe90d4a59d
|
Provenance
The following attestation bundles were made for himotoki-0.3.1.tar.gz:
Publisher:
publish.yml on msr2903/himotoki
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
himotoki-0.3.1.tar.gz -
Subject digest:
066fa963d433b6028750bc41f9d16ce5c4006ea74813d20397e369eabb947bad - Sigstore transparency entry: 947347201
- Sigstore integration time:
-
Permalink:
msr2903/himotoki@4c2a07d1224f5863bbdbc2740f129637e274942b -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/msr2903
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c2a07d1224f5863bbdbc2740f129637e274942b -
Trigger Event:
release
-
Statement type:
File details
Details for the file himotoki-0.3.1-py3-none-any.whl.
File metadata
- Download URL: himotoki-0.3.1-py3-none-any.whl
- Upload date:
- Size: 174.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c36dcd2950403f58b2a3ee64cefda1a819cc9d97738bc93221f513b3e759cbb8
|
|
| MD5 |
a70d9b3844c38b57f54ba4461461eee6
|
|
| BLAKE2b-256 |
f80097fe5cf654655fff1a1c3ad1cb58677eac700281d542a5d691c7bc24061c
|
Provenance
The following attestation bundles were made for himotoki-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on msr2903/himotoki
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
himotoki-0.3.1-py3-none-any.whl -
Subject digest:
c36dcd2950403f58b2a3ee64cefda1a819cc9d97738bc93221f513b3e759cbb8 - Sigstore transparency entry: 947347208
- Sigstore integration time:
-
Permalink:
msr2903/himotoki@4c2a07d1224f5863bbdbc2740f129637e274942b -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/msr2903
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4c2a07d1224f5863bbdbc2740f129637e274942b -
Trigger Event:
release
-
Statement type: