High-performance morphological analyzer compatible with MeCab
Project description
MeCrab - High-Performance Morphological Analyzer
A pure Rust implementation of a morphological analyzer compatible with MeCab dictionaries (IPADIC format).
Workspace Structure
mecrab/
├── mecrab/ # Core library - runtime morphological analyzer
├── mecrab-builder/ # Data pipeline - Wikidata/Wikipedia processing
├── mecrab-word2vec/ # Word2Vec training with Hogwild! parallelization
├── kizame/ # CLI tool - KizaMe (刻め!)
├── fuzz/ # Fuzz testing
└── docs/ # Engineering directives and design docs
| Crate | Description | Dependencies |
|---|---|---|
mecrab |
Core runtime library | Minimal (lightweight) |
mecrab-builder |
Semantic dictionary builder | Heavy (tokio, reqwest) |
mecrab-word2vec |
High-performance Word2Vec training | rayon, rand |
kizame |
CLI with optional builder | Lightweight default, heavy with --features builder |
Features
- High Performance: SIMD-accelerated Viterbi (AVX2), parallel batch processing
- Zero-copy Parsing: Memory-mapped dictionary loading
- Thread-safe: Safe concurrent access using Rust's ownership model
- Live Dictionary Updates: Add/remove words at runtime without restart
- Semantic Enrichment: Wikidata URI linking, JSON-LD/RDF export (Turtle, N-Triples, N-Quads)
- N-best Paths: A* algorithm for multiple path analysis
- Streaming Processing: Sentence boundary detection for large text
- Text Normalization: NFKC, width conversion, case folding
- Phonetic Transduction: Kana ↔ Romaji, X-SAMPA, IPA conversion
- Word Embeddings: Pure Rust Word2Vec with Hogwild! parallelization (83% efficiency)
- Interactive TUI Debugger: Lattice explorer with cost visualization (screenshot)
- Cross-platform: Native binaries, WASM, Python bindings
Note: When using
--with-ipawith default text output, pipe throughcatfor proper terminal display of Unicode IPA characters. This is not required for JSON-LD output.
Installation
CLI (KizaMe)
# Default (lightweight)
cargo install kizame
# With Wikidata builder
cargo install kizame --features builder
Rust Library
[dependencies]
mecrab = "0.1"
# Optional features
mecrab = { version = "0.1", features = ["json", "parallel"] }
Quick Start
CLI Usage
# Initialize dictionary (finds system IPADIC)
kizame dict init
# Interactive parsing
echo "すもももももももものうち" | kizame
# With dictionary path
kizame -d /var/lib/mecab/dic/ipadic-utf8 parse
# Wakati (space-separated)
echo "日本語の形態素解析" | kizame -w
# JSON output
echo "東京都" | kizame -O json
# With IPA pronunciation (default format - requires | cat for terminal display)
echo "こんにちは" | kizame parse --with-ipa | cat
# With word embeddings (default format)
echo "私は学生です" | kizame parse --with-vector -v /path/to/vectors.bin | cat
# With both IPA and vectors (default format)
echo "東京に行く" | kizame parse --with-ipa --with-vector -v /path/to/vectors.bin | cat
# JSON-LD with semantic URIs
echo "東京に行く" | kizame -O jsonld --with-semantic
# JSON-LD with IPA pronunciation
echo "こんにちは" | kizame -O jsonld --with-ipa
# JSON-LD with word embeddings
echo "私は学生です" | kizame -O jsonld --with-vector -v /path/to/vectors.bin
# RDF formats (Turtle, N-Triples, N-Quads)
echo "東京に行く" | kizame -O turtle
echo "東京に行く" | kizame -O ntriples
echo "東京に行く" | kizame -O nquads
# Dictionary info
kizame dict info
kizame dict dump -d /path/to/dic
# Interactive TUI debugger
kizame explore "東京に行く"
Rust API
use mecrab::MeCrab;
let mecrab = MeCrab::new()?;
let result = mecrab.parse("すもももももももものうち")?;
println!("{}", result);
// Add custom words at runtime
mecrab.add_word("ChatGPT", "チャットジーピーティー", "チャットジーピーティー", 5000);
// N-best paths
use mecrab::viterbi::NbestSearch;
let nbest = NbestSearch::new(&mecrab);
for path in nbest.search("東京", 5)? {
println!("Cost: {}, Path: {:?}", path.total_cost, path.nodes);
}
// Streaming processing
use mecrab::stream::SentenceReader;
let reader = SentenceReader::new(input);
for sentence in reader {
let result = mecrab.parse(&sentence)?;
}
// Phonetic conversion
use mecrab::phonetic::PhoneticTransducer;
let transducer = PhoneticTransducer::new();
let romaji = transducer.to_romaji("こんにちは"); // "konnichiha"
Training Word Embeddings
KizaMe includes a high-performance pure Rust Word2Vec implementation with Hogwild! lock-free parallelization:
Performance
- 83% parallel efficiency on 6 cores (499.7% CPU usage)
- ~500K words/sec/core training throughput
- 2.27x speedup vs mutex-based baseline
- Memory-mapped MCV1 format for instant vector loading
Training Pipeline
# Extract vocabulary
kizame dict dump -d /var/lib/mecab/dic/ipadic-utf8 --vocab > vocab.txt
MAX_WORD_ID=$(tail -1 vocab.txt | cut -f1)
# Parse corpus to word_id sequences
cat corpus.txt | kizame parse --wakati-word-id > corpus_ids.txt
# Train Word2Vec (MCV1 binary format - recommended)
kizame vectors train \
-i corpus_ids.txt \
-o vectors.bin \
-f mcv1 \
--max-word-id $MAX_WORD_ID \
--size 100 \
--window 5 \
--negative 5 \
--epochs 3 \
--threads 6 # Automatically uses Hogwild!
# Use trained vectors
echo "東京に行く" | kizame parse --with-vector -v vectors.bin --with-ipa | cat
Technical Details: See mecrab-word2vec/IMPLEMENTATION.md for Hogwild! algorithm details, safety analysis, and performance benchmarks.
Training Guide: See docs/WORD2VEC_TRAINING_GUIDE.md for full training pipeline with Japanese Wikipedia corpus.
Building Semantic Dictionaries
With --features builder, KizaMe can build semantic-enriched dictionaries from Wikidata:
# Download Wikidata dump
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
# Build extended dictionary
kizame build \
--source ipadic.csv \
--wikidata latest-all.json.gz \
--output ./semantic-dic \
--max-candidates 5
# Use semantic enrichment
echo "東京に行く" | kizame -d ./semantic-dic -O jsonld --with-semantic
Output includes Wikidata URIs:
{
"@context": { "wd": "http://www.wikidata.org/entity/", ... },
"tokens": [
{
"surface": "東京",
"pos": "名詞",
"wcost": 3003,
"entities": [
{"@id": "wd:Q1490", "confidence": 0.95}
]
}
]
}
Development
# Run tests
cargo nextest run
# Clippy (no warnings policy)
cargo clippy --workspace
# Benchmarks
cargo bench -p mecrab
# Fuzz testing
cd fuzz && cargo +nightly fuzz run viterbi
Statistics
- ~11,000 lines of Rust
- 174 tests (all passing)
- 4 fuzz targets
- 0 clippy warnings
- 83% parallel efficiency (Word2Vec training)
License
MIT OR Apache-2.0
Copyright
Copyright 2026 COOLJAPAN OU (Team KitaSan)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mecrab-0.1.2.tar.gz.
File metadata
- Download URL: mecrab-0.1.2.tar.gz
- Upload date:
- Size: 121.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4133de4995394d8c0f518a01dfd97ad95d6ec2d4d4eff2d3cd9b3239fe3cf33f
|
|
| MD5 |
63ee8e6ce1ec9dcc583e0fe65a042f28
|
|
| BLAKE2b-256 |
4cdefe4f3f991e217ca31eb762da26e72847da8583e8399c699ad9d1b1aec46b
|
File details
Details for the file mecrab-0.1.2-cp314-cp314-macosx_11_0_arm64.whl.
File metadata
- Download URL: mecrab-0.1.2-cp314-cp314-macosx_11_0_arm64.whl
- Upload date:
- Size: 329.3 kB
- Tags: CPython 3.14, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4503185078db37f242e88a0b3af75477ee0cec0bf921a15a2642391ceaa558d4
|
|
| MD5 |
04205efa22c009199f658f7e1eeae0b4
|
|
| BLAKE2b-256 |
e0b8c22952351588e59dad6de27984b6dfb44391b0bea5ba3040db8cdb06be93
|