Japanese Morphological Analyzer and Romanization tool - Python remake of ichiran

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Himotoki (紐解き)

Himotoki (紐解き, "unraveling") is a Python remake of ichiran, the comprehensive Japanese morphological analyzer. It segments Japanese text into words, provides dictionary definitions, and traces conjugation chains back to their root forms -- all powered by a portable SQLite backend.

Features

Portable SQLite Backend -- No PostgreSQL setup required. Dictionary data lives in a single file (~3 GB) that is generated on first use.
Dynamic-Programming Segmentation -- Uses a Viterbi-style algorithm to find the most linguistically plausible word boundaries.
Deep Dictionary Integration -- Built on JMDict, providing glosses, part-of-speech tags, usage notes, and cross-references.
Recursive Deconjugation -- Walks the conjugation database to trace inflected forms (passive, causative, te-form, negation, etc.) back to dictionary entries.
Conjugation Breakdown Tree -- Displays each transformation step in a visual tree with the suffix, grammatical label, and English gloss.
Compound Word Detection -- Recognizes suffix compounds (te-iru progressive, te-shimau completion, tai desiderative, sou appearance, etc.) and shows their internal structure.
Scoring Engine -- Implements synergy and penalty heuristics from ichiran to resolve segmentation ambiguities.

Installation

pip install himotoki

First-Time Setup

On first run, Himotoki will offer to download JMDict and build the SQLite database. The process takes approximately 10-20 minutes and requires about 3 GB of free disk space.

himotoki "日本語テキスト"

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Welcome to Himotoki!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

First-time setup required. This will:
  - Download JMdict dictionary data (~15MB compressed)
  - Generate optimized SQLite database (~3GB)
  - Store data in ~/.himotoki/

Proceed with setup? [Y/n]:

Non-interactive setup for CI environments:

himotoki setup --yes

Usage

Command Line

# Default: dictionary info with conjugation breakdown
himotoki "学校で勉強しています"

* 学校 【がっこう】
1. [n] school

* で
1. [prt] at; in
2. [prt] at; when
3. [prt] by; with

* 勉強しています 【べんきょう しています】
1. [n,vs,vt] study
2. [n,vs,vi] diligence; working hard
  └─ する (makes a verb from a noun)
       └─ Conjunctive (~te) (て)
            └─ いる (indicates continuing action (to be ...ing))
                 └─ Polite (ます)

# Full output: romanization + dictionary info + conjugation tree
himotoki -f "食べられなかった"

taberarenakatta

* taberarenakatta  食べられなかった 【たべられなかった】
1. [v1,vt] to eat
2. [v1,vt] to live on (e.g. a salary); to live off; to subsist on
  ← 食べる 【たべる】
  └─ Potential/Passive (れる): can do / is done (to)
       └─ Negative (ない): not
            └─ Past (~ta) (かった): did/was

# Simple romanization
himotoki -r "学校で勉強しています"
# Output: gakkou de benkyou shiteimasu

# Kana reading with spaces
himotoki -k "学校で勉強しています"
# Output: がっこう で べんきょう しています

# JSON output for programmatic use
himotoki -j "学校で勉強しています"

Python API

import himotoki

# Optional: pre-warm caches for faster first request
himotoki.warm_up()

# Analyze Japanese text
results = himotoki.analyze("日本語を勉強しています")

for words, score in results:
    for w in words:
        print(f"{w.text} 【{w.kana}】 - {w.gloss[:50]}...")

Conjugation Breakdown

Himotoki traces conjugated words through every transformation step, showing the root form and each inflection applied:

$ himotoki -f "書かせられていた"

kakaserareteita

* kakaserareteita  書かせられていた 【かかせられていた】
1. [v5k,vt] to write; to compose; to pen
2. [v5k,vt] to draw; to paint
  ← 書く 【かく】
  └─ Causative (かせ): makes do
       └─ Passive (られる): is done (to)
      └─ Conjunctive (~te, progressive) (て)
        └─ Past (~ta) (た): did/was

A deeply nested chain parsed into its constituent parts:

$ himotoki -f "飲んでしまいたかった"

nondeshimaitakatta

* nondeshimaitakatta  飲んでしまいたかった 【のんでしまいたかった】
1. [v5m,vt] to drink; to swallow; to take (medicine)
  ← 飲む 【のむ】
  └─ Conjunctive (~te) (んで)
       └─ しまう (indicates completion / to do something by accident or regret)
            └─ Continuative (~i) (い): and (stem)
                 └─ たい (want to... / would like to...)
                      └─ Past (~ta) (かった): did/was

Full sentence analysis with per-word dictionary entries and conjugation trees:

$ himotoki "学校で勉強しています"

* 学校 【がっこう】
1. [n] school

* で
1. [prt] at; in
2. [prt] at; when
3. [prt] by; with

* 勉強しています 【べんきょう しています】
1. [n,vs,vt] study
2. [n,vs,vi] diligence; working hard
  ← 勉強する 【べんきょうする】
  └─ Conjunctive (~te, progressive) (て)
    └─ Polite (ます)

Suffix Compounds

Himotoki recognizes productive suffix patterns and merges them into compound words with grammatical labels:

Category	Suffixes	Example
Progressive / completive	ている, てある, てしまう, ておく	食べている → "is eating"
Giving / receiving	てくれる, てもらう, てあげる, てやる	読んであげる → "read for someone"
Desire / attempt	たい, てほしい, てみる	食べたい → "want to eat"
Appearance / degree	そう, すぎる, っぽい, らしい	高すぎる → "too expensive"
Compound verbs	出す, 切る, 合う, 込む, 始める, 終わる	食べ始める → "start eating"
Nominalization	さ, み, 方	深み → "depth"
Contractions	ちゃう, じゃう, とく, なきゃ	食べちゃった → "ended up eating"
Polite / formal	です, でしょう, ください, いたす	食べてください → "please eat"
Na-adjective suffixes	すぎる, っぽい, み, そう	静かすぎる → "too quiet"

How It Works

Himotoki processes Japanese text through three stages:

Segmentation -- A dynamic-programming algorithm considers all possible word boundaries and selects the highest-scoring path. Scoring uses dictionary frequency data, part-of-speech synergies, and penalty heuristics ported from ichiran.
Suffix Compound Assembly -- Adjacent segments are checked against known suffix patterns (te-iru, te-shimau, tai, sou, etc.). Matching segments are merged into compound WordInfo objects with preserved component structure.
Conjugation Chain Resolution -- For each conjugated word, the system queries the conjugation database to walk the via chain from the surface form back to the dictionary entry. Each step records the conjugation type, suffix text, and English gloss, then formats the result as an indented tree.

Project Structure

himotoki/
    segment.py             # Viterbi-style segmentation engine
    lookup.py              # Dictionary lookup, scoring, conjugation data
    output.py              # WordInfo, conjugation tree, JSON/text formatting
    suffixes.py            # Suffix compound detection (te-iru, tai, etc.)
    synergies.py           # Part-of-speech synergy and penalty rules
    conjugation_hints.py   # Supplementary conjugation patterns
    constants.py           # Conjugation type IDs, POS tags, glosses
    characters.py          # Kana/kanji conversion, romanization
    counters.py            # Japanese counter expression handling
    cli.py                 # Command-line interface
    db/                    # SQLAlchemy models and connection management
    loading/               # JMDict XML parsing and database generation
scripts/
    llm_eval.py            # LLM-based accuracy evaluation (510 sentences)
    check_segments.py      # Quick segmentation change checker
    llm_report.py          # HTML report generator
tests/                     # 433 tests (pytest + hypothesis)
data/                      # Dictionary data, evaluation datasets

Development

Setup

git clone https://github.com/msr2903/himotoki.git
cd himotoki
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -x --tb=short

Testing

# Run all tests
pytest tests/ -x --tb=short

# Run conjugation tree tests only
pytest tests/test_conjugation_tree.py -v

# Run with coverage
pytest tests/ --cov=himotoki --cov-report=term-missing

LLM Accuracy Evaluation

The project includes an LLM-based evaluation system that scores segmentation accuracy against 510 curated Japanese sentences:

python scripts/llm_eval.py --quick          # 50-sentence subset
python scripts/llm_eval.py                  # Full evaluation
python scripts/llm_eval.py --rescore 5      # Re-evaluate entry #5
python scripts/llm_report.py                # Generate HTML report

Current accuracy: 510/510 (100%) on the v3 evaluation prompt.

License

Distributed under the MIT License. See LICENSE for details.

Acknowledgments

tshatrov for the original ichiran implementation.
EDRDG for the JMDict dictionary resource.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

msr2903

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Feb 13, 2026

0.3.0

Feb 7, 2026

0.2.3

Jan 14, 2026

0.2.2

Jan 12, 2026

0.2.1

Jan 11, 2026

0.2.0

Jan 11, 2026

0.1.1

Jan 10, 2026

0.1.0

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

himotoki-0.3.1.tar.gz (218.9 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

himotoki-0.3.1-py3-none-any.whl (174.8 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file himotoki-0.3.1.tar.gz.

File metadata

Download URL: himotoki-0.3.1.tar.gz
Upload date: Feb 13, 2026
Size: 218.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for himotoki-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`066fa963d433b6028750bc41f9d16ce5c4006ea74813d20397e369eabb947bad`
MD5	`fe7249848bd58e326592df325123b04d`
BLAKE2b-256	`6300e0ecd2c3cec810965d033151a93db0e09f8c312420c36d9cbdfe90d4a59d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for himotoki-0.3.1.tar.gz:

Publisher: publish.yml on msr2903/himotoki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: himotoki-0.3.1.tar.gz
- Subject digest: 066fa963d433b6028750bc41f9d16ce5c4006ea74813d20397e369eabb947bad
- Sigstore transparency entry: 947347201
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: msr2903/himotoki@4c2a07d1224f5863bbdbc2740f129637e274942b
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/msr2903
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4c2a07d1224f5863bbdbc2740f129637e274942b
- Trigger Event: release

File details

Details for the file himotoki-0.3.1-py3-none-any.whl.

File metadata

Download URL: himotoki-0.3.1-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 174.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for himotoki-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c36dcd2950403f58b2a3ee64cefda1a819cc9d97738bc93221f513b3e759cbb8`
MD5	`a70d9b3844c38b57f54ba4461461eee6`
BLAKE2b-256	`f80097fe5cf654655fff1a1c3ad1cb58677eac700281d542a5d691c7bc24061c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for himotoki-0.3.1-py3-none-any.whl:

Publisher: publish.yml on msr2903/himotoki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: himotoki-0.3.1-py3-none-any.whl
- Subject digest: c36dcd2950403f58b2a3ee64cefda1a819cc9d97738bc93221f513b3e759cbb8
- Sigstore transparency entry: 947347208
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: msr2903/himotoki@4c2a07d1224f5863bbdbc2740f129637e274942b
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/msr2903
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4c2a07d1224f5863bbdbc2740f129637e274942b
- Trigger Event: release

himotoki 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Himotoki (紐解き)

Features

Installation

First-Time Setup

Usage

Command Line

Python API

Conjugation Breakdown

Suffix Compounds

How It Works

Project Structure

Development

Setup

Testing

LLM Accuracy Evaluation

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance