Skip to main content

Japanese Morphological Analyzer and Romanization tool - Python remake of ichiran

Project description

Himotoki (紐解き)

Python Version License: MIT Tests

Himotoki (紐解き, "unraveling") is a Python remake of ichiran, the comprehensive Japanese morphological analyzer. It segments Japanese text into words, provides dictionary definitions, and traces conjugation chains back to their root forms -- all powered by a portable SQLite backend.


Features

  • Portable SQLite Backend -- No PostgreSQL setup required. Dictionary data lives in a single file (~3 GB) that is generated on first use.
  • Dynamic-Programming Segmentation -- Uses a Viterbi-style algorithm to find the most linguistically plausible word boundaries.
  • Deep Dictionary Integration -- Built on JMDict, providing glosses, part-of-speech tags, usage notes, and cross-references.
  • Recursive Deconjugation -- Walks the conjugation database to trace inflected forms (passive, causative, te-form, negation, etc.) back to dictionary entries.
  • Conjugation Breakdown Tree -- Displays each transformation step in a visual tree with the suffix, grammatical label, and English gloss.
  • Compound Word Detection -- Recognizes suffix compounds (te-iru progressive, te-shimau completion, tai desiderative, sou appearance, etc.) and shows their internal structure.
  • Scoring Engine -- Implements synergy and penalty heuristics from ichiran to resolve segmentation ambiguities.

Installation

pip install himotoki

First-Time Setup

On first run, Himotoki will offer to download JMDict and build the SQLite database. The process takes approximately 10-20 minutes and requires about 3 GB of free disk space.

himotoki "日本語テキスト"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Welcome to Himotoki!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

First-time setup required. This will:
  - Download JMdict dictionary data (~15MB compressed)
  - Generate optimized SQLite database (~3GB)
  - Store data in ~/.himotoki/

Proceed with setup? [Y/n]:

Non-interactive setup for CI environments:

himotoki setup --yes

Usage

Command Line

# Default: dictionary info with conjugation breakdown
himotoki "学校で勉強しています"
* 学校 【がっこう】
1. [n] school

* で
1. [prt] at; in
2. [prt] at; when
3. [prt] by; with

* 勉強しています 【べんきょう しています】
1. [n,vs,vt] study
2. [n,vs,vi] diligence; working hard
  └─ する (makes a verb from a noun)
       └─ Conjunctive (~te) (て)
            └─ いる (indicates continuing action (to be ...ing))
                 └─ Polite (ます)
# Full output: romanization + dictionary info + conjugation tree
himotoki -f "食べられなかった"
taberarenakatta

* taberarenakatta  食べられなかった 【たべられなかった】
1. [v1,vt] to eat
2. [v1,vt] to live on (e.g. a salary); to live off; to subsist on
  ← 食べる 【たべる】
  └─ Potential/Passive (れる): can do / is done (to)
       └─ Negative (ない): not
            └─ Past (~ta) (かった): did/was
# Simple romanization
himotoki -r "学校で勉強しています"
# Output: gakkou de benkyou shiteimasu

# Kana reading with spaces
himotoki -k "学校で勉強しています"
# Output: がっこう で べんきょう しています

# JSON output for programmatic use
himotoki -j "学校で勉強しています"

Python API

import himotoki

# Optional: pre-warm caches for faster first request
himotoki.warm_up()

# Analyze Japanese text
results = himotoki.analyze("日本語を勉強しています")

for words, score in results:
    for w in words:
        print(f"{w.text}{w.kana}】 - {w.gloss[:50]}...")

Conjugation Breakdown

Himotoki traces conjugated words through every transformation step, showing the root form and each inflection applied:

$ himotoki -f "書かせられていた"

kakaserareteita

* kakaserareteita  書かせられていた 【かかせられていた】
1. [v5k,vt] to write; to compose; to pen
2. [v5k,vt] to draw; to paint
  ← 書く 【かく】
  └─ Causative (かせ): makes do
       └─ Passive (られる): is done (to)
      └─ Conjunctive (~te, progressive) (て)
        └─ Past (~ta) (た): did/was

A deeply nested chain parsed into its constituent parts:

$ himotoki -f "飲んでしまいたかった"

nondeshimaitakatta

* nondeshimaitakatta  飲んでしまいたかった 【のんでしまいたかった】
1. [v5m,vt] to drink; to swallow; to take (medicine)
  ← 飲む 【のむ】
  └─ Conjunctive (~te) (んで)
       └─ しまう (indicates completion / to do something by accident or regret)
            └─ Continuative (~i) (い): and (stem)
                 └─ たい (want to... / would like to...)
                      └─ Past (~ta) (かった): did/was

Full sentence analysis with per-word dictionary entries and conjugation trees:

$ himotoki "学校で勉強しています"

* 学校 【がっこう】
1. [n] school

* で
1. [prt] at; in
2. [prt] at; when
3. [prt] by; with

* 勉強しています 【べんきょう しています】
1. [n,vs,vt] study
2. [n,vs,vi] diligence; working hard
  ← 勉強する 【べんきょうする】
  └─ Conjunctive (~te, progressive) (て)
    └─ Polite (ます)

Suffix Compounds

Himotoki recognizes productive suffix patterns and merges them into compound words with grammatical labels:

Category Suffixes Example
Progressive / completive ている, てある, てしまう, ておく 食べている → "is eating"
Giving / receiving てくれる, てもらう, てあげる, てやる 読んであげる → "read for someone"
Desire / attempt たい, てほしい, てみる 食べたい → "want to eat"
Appearance / degree そう, すぎる, っぽい, らしい 高すぎる → "too expensive"
Compound verbs 出す, 切る, 合う, 込む, 始める, 終わる 食べ始める → "start eating"
Nominalization さ, み, 方 深み → "depth"
Contractions ちゃう, じゃう, とく, なきゃ 食べちゃった → "ended up eating"
Polite / formal です, でしょう, ください, いたす 食べてください → "please eat"
Na-adjective suffixes すぎる, っぽい, み, そう 静かすぎる → "too quiet"

How It Works

Himotoki processes Japanese text through three stages:

  1. Segmentation -- A dynamic-programming algorithm considers all possible word boundaries and selects the highest-scoring path. Scoring uses dictionary frequency data, part-of-speech synergies, and penalty heuristics ported from ichiran.

  2. Suffix Compound Assembly -- Adjacent segments are checked against known suffix patterns (te-iru, te-shimau, tai, sou, etc.). Matching segments are merged into compound WordInfo objects with preserved component structure.

  3. Conjugation Chain Resolution -- For each conjugated word, the system queries the conjugation database to walk the via chain from the surface form back to the dictionary entry. Each step records the conjugation type, suffix text, and English gloss, then formats the result as an indented tree.


Project Structure

himotoki/
    segment.py             # Viterbi-style segmentation engine
    lookup.py              # Dictionary lookup, scoring, conjugation data
    output.py              # WordInfo, conjugation tree, JSON/text formatting
    suffixes.py            # Suffix compound detection (te-iru, tai, etc.)
    synergies.py           # Part-of-speech synergy and penalty rules
    conjugation_hints.py   # Supplementary conjugation patterns
    constants.py           # Conjugation type IDs, POS tags, glosses
    characters.py          # Kana/kanji conversion, romanization
    counters.py            # Japanese counter expression handling
    cli.py                 # Command-line interface
    db/                    # SQLAlchemy models and connection management
    loading/               # JMDict XML parsing and database generation
scripts/
    llm_eval.py            # LLM-based accuracy evaluation (510 sentences)
    check_segments.py      # Quick segmentation change checker
    llm_report.py          # HTML report generator
tests/                     # 433 tests (pytest + hypothesis)
data/                      # Dictionary data, evaluation datasets

Development

Setup

git clone https://github.com/msr2903/himotoki.git
cd himotoki
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -x --tb=short

Testing

# Run all tests
pytest tests/ -x --tb=short

# Run conjugation tree tests only
pytest tests/test_conjugation_tree.py -v

# Run with coverage
pytest tests/ --cov=himotoki --cov-report=term-missing

LLM Accuracy Evaluation

The project includes an LLM-based evaluation system that scores segmentation accuracy against 510 curated Japanese sentences:

python scripts/llm_eval.py --quick          # 50-sentence subset
python scripts/llm_eval.py                  # Full evaluation
python scripts/llm_eval.py --rescore 5      # Re-evaluate entry #5
python scripts/llm_report.py                # Generate HTML report

Current accuracy: 510/510 (100%) on the v3 evaluation prompt.


License

Distributed under the MIT License. See LICENSE for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

himotoki-0.3.1.tar.gz (218.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

himotoki-0.3.1-py3-none-any.whl (174.8 kB view details)

Uploaded Python 3

File details

Details for the file himotoki-0.3.1.tar.gz.

File metadata

  • Download URL: himotoki-0.3.1.tar.gz
  • Upload date:
  • Size: 218.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for himotoki-0.3.1.tar.gz
Algorithm Hash digest
SHA256 066fa963d433b6028750bc41f9d16ce5c4006ea74813d20397e369eabb947bad
MD5 fe7249848bd58e326592df325123b04d
BLAKE2b-256 6300e0ecd2c3cec810965d033151a93db0e09f8c312420c36d9cbdfe90d4a59d

See more details on using hashes here.

Provenance

The following attestation bundles were made for himotoki-0.3.1.tar.gz:

Publisher: publish.yml on msr2903/himotoki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file himotoki-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: himotoki-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 174.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for himotoki-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c36dcd2950403f58b2a3ee64cefda1a819cc9d97738bc93221f513b3e759cbb8
MD5 a70d9b3844c38b57f54ba4461461eee6
BLAKE2b-256 f80097fe5cf654655fff1a1c3ad1cb58677eac700281d542a5d691c7bc24061c

See more details on using hashes here.

Provenance

The following attestation bundles were made for himotoki-0.3.1-py3-none-any.whl:

Publisher: publish.yml on msr2903/himotoki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page