High-performance morphological analyzer compatible with MeCab

These details have not been verified by PyPI

Project links

Project description

MeCrab - High-Performance Morphological Analyzer

MeCrab

A pure Rust implementation of a morphological analyzer compatible with MeCab dictionaries (IPADIC format).

Workspace Structure

mecrab/
├── mecrab/          # Core library - runtime morphological analyzer
├── mecrab-builder/  # Data pipeline - Wikidata/Wikipedia processing
├── mecrab-word2vec/ # Word2Vec training with Hogwild! parallelization
├── kizame/          # CLI tool - KizaMe (刻め!)
├── fuzz/            # Fuzz testing
└── docs/            # Engineering directives and design docs

Crate	Description	Dependencies
`mecrab`	Core runtime library	Minimal (lightweight)
`mecrab-builder`	Semantic dictionary builder	Heavy (tokio, reqwest)
`mecrab-word2vec`	High-performance Word2Vec training	rayon, rand
`kizame`	CLI with optional builder	Lightweight default, heavy with `--features builder`

Features

High Performance: SIMD-accelerated Viterbi (AVX2), parallel batch processing
Zero-copy Parsing: Memory-mapped dictionary loading
Thread-safe: Safe concurrent access using Rust's ownership model
Live Dictionary Updates: Add/remove words at runtime without restart
Semantic Enrichment: Wikidata URI linking, JSON-LD/RDF export (Turtle, N-Triples, N-Quads)
N-best Paths: A* algorithm for multiple path analysis
Streaming Processing: Sentence boundary detection for large text
Text Normalization: NFKC, width conversion, case folding
Phonetic Transduction: Kana ↔ Romaji, X-SAMPA, IPA conversion
Word Embeddings: Pure Rust Word2Vec with Hogwild! parallelization (83% efficiency)
Interactive TUI Debugger: Lattice explorer with cost visualization (screenshot)
Cross-platform: Native binaries, WASM, Python bindings

Note: When using --with-ipa with default text output, pipe through cat for proper terminal display of Unicode IPA characters. This is not required for JSON-LD output.

Installation

CLI (KizaMe)

# Default (lightweight)
cargo install kizame

# With Wikidata builder
cargo install kizame --features builder

Rust Library

[dependencies]
mecrab = "0.1"

# Optional features
mecrab = { version = "0.1", features = ["json", "parallel"] }

Quick Start

CLI Usage

# Initialize dictionary (finds system IPADIC)
kizame dict init

# Interactive parsing
echo "すもももももももものうち" | kizame

# With dictionary path
kizame -d /var/lib/mecab/dic/ipadic-utf8 parse

# Wakati (space-separated)
echo "日本語の形態素解析" | kizame -w

# JSON output
echo "東京都" | kizame -O json

# With IPA pronunciation (default format - requires | cat for terminal display)
echo "こんにちは" | kizame parse --with-ipa | cat

# With word embeddings (default format)
echo "私は学生です" | kizame parse --with-vector -v /path/to/vectors.bin | cat

# With both IPA and vectors (default format)
echo "東京に行く" | kizame parse --with-ipa --with-vector -v /path/to/vectors.bin | cat

# JSON-LD with semantic URIs
echo "東京に行く" | kizame -O jsonld --with-semantic

# JSON-LD with IPA pronunciation
echo "こんにちは" | kizame -O jsonld --with-ipa

# JSON-LD with word embeddings
echo "私は学生です" | kizame -O jsonld --with-vector -v /path/to/vectors.bin

# RDF formats (Turtle, N-Triples, N-Quads)
echo "東京に行く" | kizame -O turtle
echo "東京に行く" | kizame -O ntriples
echo "東京に行く" | kizame -O nquads

# Dictionary info
kizame dict info
kizame dict dump -d /path/to/dic

# Interactive TUI debugger
kizame explore "東京に行く"

Rust API

use mecrab::MeCrab;

let mecrab = MeCrab::new()?;
let result = mecrab.parse("すもももももももものうち")?;
println!("{}", result);

// Add custom words at runtime
mecrab.add_word("ChatGPT", "チャットジーピーティー", "チャットジーピーティー", 5000);

// N-best paths
use mecrab::viterbi::NbestSearch;
let nbest = NbestSearch::new(&mecrab);
for path in nbest.search("東京", 5)? {
    println!("Cost: {}, Path: {:?}", path.total_cost, path.nodes);
}

// Streaming processing
use mecrab::stream::SentenceReader;
let reader = SentenceReader::new(input);
for sentence in reader {
    let result = mecrab.parse(&sentence)?;
}

// Phonetic conversion
use mecrab::phonetic::PhoneticTransducer;
let transducer = PhoneticTransducer::new();
let romaji = transducer.to_romaji("こんにちは"); // "konnichiha"

Training Word Embeddings

KizaMe includes a high-performance pure Rust Word2Vec implementation with Hogwild! lock-free parallelization:

Performance

83% parallel efficiency on 6 cores (499.7% CPU usage)
~500K words/sec/core training throughput
2.27x speedup vs mutex-based baseline
Memory-mapped MCV1 format for instant vector loading

Training Pipeline

# Extract vocabulary
kizame dict dump -d /var/lib/mecab/dic/ipadic-utf8 --vocab > vocab.txt
MAX_WORD_ID=$(tail -1 vocab.txt | cut -f1)

# Parse corpus to word_id sequences
cat corpus.txt | kizame parse --wakati-word-id > corpus_ids.txt

# Train Word2Vec (MCV1 binary format - recommended)
kizame vectors train \
  -i corpus_ids.txt \
  -o vectors.bin \
  -f mcv1 \
  --max-word-id $MAX_WORD_ID \
  --size 100 \
  --window 5 \
  --negative 5 \
  --epochs 3 \
  --threads 6  # Automatically uses Hogwild!

# Use trained vectors
echo "東京に行く" | kizame parse --with-vector -v vectors.bin --with-ipa | cat

Technical Details: See mecrab-word2vec/IMPLEMENTATION.md for Hogwild! algorithm details, safety analysis, and performance benchmarks.

Training Guide: See docs/WORD2VEC_TRAINING_GUIDE.md for full training pipeline with Japanese Wikipedia corpus.

Building Semantic Dictionaries

With --features builder, KizaMe can build semantic-enriched dictionaries from Wikidata:

# Download Wikidata dump
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz

# Build extended dictionary
kizame build \
  --source ipadic.csv \
  --wikidata latest-all.json.gz \
  --output ./semantic-dic \
  --max-candidates 5

# Use semantic enrichment
echo "東京に行く" | kizame -d ./semantic-dic -O jsonld --with-semantic

Output includes Wikidata URIs:

{
  "@context": { "wd": "http://www.wikidata.org/entity/", ... },
  "tokens": [
    {
      "surface": "東京",
      "pos": "名詞",
      "wcost": 3003,
      "entities": [
        {"@id": "wd:Q1490", "confidence": 0.95}
      ]
    }
  ]
}

Development

# Run tests
cargo nextest run

# Clippy (no warnings policy)
cargo clippy --workspace

# Benchmarks
cargo bench -p mecrab

# Fuzz testing
cd fuzz && cargo +nightly fuzz run viterbi

Statistics

~11,000 lines of Rust
174 tests (all passing)
4 fuzz targets
0 clippy warnings
83% parallel efficiency (Word2Vec training)

License

MIT OR Apache-2.0

Copyright

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jan 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mecrab-0.1.2.tar.gz (121.4 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mecrab-0.1.2-cp314-cp314-macosx_11_0_arm64.whl (329.3 kB view details)

Uploaded Jan 6, 2026 CPython 3.14macOS 11.0+ ARM64

File details

Details for the file mecrab-0.1.2.tar.gz.

File metadata

Download URL: mecrab-0.1.2.tar.gz
Upload date: Jan 6, 2026
Size: 121.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.8.7

File hashes

Hashes for mecrab-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`4133de4995394d8c0f518a01dfd97ad95d6ec2d4d4eff2d3cd9b3239fe3cf33f`
MD5	`63ee8e6ce1ec9dcc583e0fe65a042f28`
BLAKE2b-256	`4cdefe4f3f991e217ca31eb762da26e72847da8583e8399c699ad9d1b1aec46b`

See more details on using hashes here.

File details

Details for the file mecrab-0.1.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

Download URL: mecrab-0.1.2-cp314-cp314-macosx_11_0_arm64.whl
Upload date: Jan 6, 2026
Size: 329.3 kB
Tags: CPython 3.14, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.8.7

File hashes

Hashes for mecrab-0.1.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`4503185078db37f242e88a0b3af75477ee0cec0bf921a15a2642391ceaa558d4`
MD5	`04205efa22c009199f658f7e1eeae0b4`
BLAKE2b-256	`e0b8c22952351588e59dad6de27984b6dfb44391b0bea5ba3040db8cdb06be93`

See more details on using hashes here.

mecrab 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MeCrab - High-Performance Morphological Analyzer

Workspace Structure

Features

Installation

CLI (KizaMe)

Rust Library

Quick Start

CLI Usage

Rust API

Training Word Embeddings

Performance

Training Pipeline

Building Semantic Dictionaries

Development

Statistics

License

Copyright

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes