Python binding for Lindera with UniDic dictionary
Project description
lindera-python
Python binding for Lindera, a Japanese morphological analysis engine.
Overview
lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:
- Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
- Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
- Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
- Flexible Configuration: Configurable tokenization modes and penalty settings
- Metadata Support: Complete dictionary schema and metadata management
Features
Core Components
- TokenizerBuilder: Fluent API for building customized tokenizers
- Tokenizer: High-performance text tokenization with integrated filtering
- CharacterFilter: Pre-processing filters for text normalization
- TokenFilter: Post-processing filters for token refinement
- Metadata & Schema: Dictionary structure and configuration management
- Training & Export (optional): Train custom morphological analysis models from corpus data
Supported Dictionaries
- Japanese: IPADIC (embedded), UniDic (embedded)
- Korean: ko-dic (embedded)
- Chinese: CC-CEDICT (embedded)
- Custom: User dictionary support
Filter Types
Character Filters:
- Mapping filter (character replacement)
- Regex filter (pattern-based replacement)
- Unicode normalization (NFKC, etc.)
- Japanese iteration mark normalization
Token Filters:
- Text case transformation (lowercase, uppercase)
- Length filtering (min/max character length)
- Stop words filtering
- Japanese-specific filters (base form, reading form, etc.)
- Korean-specific filters
Install project dependencies
- pyenv : https://github.com/pyenv/pyenv?tab=readme-ov-file#installation
- Poetry : https://python-poetry.org/docs/#installation
- Rust : https://www.rust-lang.org/tools/install
Install Python
# Install Python
% pyenv install 3.13.5
Setup repository and activate virtual environment
# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python
# Set Python version for this project
% pyenv local 3.13.5
# Make Python virtual environment
% python -m venv .venv
# Activate Python virtual environment
% source .venv/bin/activate
# Initialize lindera-python project
(.venv) % make init
Install lindera-python as a library in the virtual environment
This command takes a long time because it builds a library that includes all the dictionaries.
(.venv) % make develop
Quick Start
Basic Tokenization
from lindera import TokenizerBuilder
# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()
# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)
for token in tokens:
print(f"Text: {token.text}, Position: {token.position}")
Using Character Filters
from lindera import TokenizerBuilder
# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text) # Will apply filters automatically
Using Token Filters
from lindera import TokenizerBuilder
# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})
# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")
Integrated Pipeline
from lindera import TokenizerBuilder
# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")
# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")
Working with Metadata
from lindera import Metadata
# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")
# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}") # First 5 fields
Advanced Usage
Filter Configuration Examples
Character filters and token filters accept configuration as dictionary arguments:
from lindera import TokenizerBuilder
builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")
# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
"normalize_kanji": "true",
"normalize_kana": "true"
})
builder.append_character_filter("mapping", {
"mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})
# Token filters with dict configuration
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
"tags": ["助詞", "助動詞", "記号"]
})
# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")
tokenizer = builder.build()
See examples/ directory for comprehensive examples including:
tokenize.py: Basic tokenizationtokenize_with_filters.py: Using character and token filterstokenize_with_userdict.py: Custom user dictionarytrain_and_export.py: Train and export custom dictionaries (requirestrainfeature)- Multi-language tokenization
- Advanced configuration options
Dictionary Support
Japanese
- IPADIC: Default Japanese dictionary, good for general text
- UniDic: Academic dictionary with detailed morphological information
Korean
- ko-dic: Standard Korean dictionary for morphological analysis
Chinese
- CC-CEDICT: Community-maintained Chinese-English dictionary
Custom Dictionaries
- User dictionary support for domain-specific terms
- CSV format for easy customization
Dictionary Training (Experimental)
lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.
Building with Training Support
# Install with training support
(.venv) % maturin develop --features train
Training a Model
import lindera
# Train a model from corpus
lindera.train(
seed="path/to/seed.csv", # Seed lexicon
corpus="path/to/corpus.txt", # Training corpus
char_def="path/to/char.def", # Character definitions
unk_def="path/to/unk.def", # Unknown word definitions
feature_def="path/to/feature.def", # Feature templates
rewrite_def="path/to/rewrite.def", # Rewrite rules
output="model.dat", # Output model file
lambda_=0.01, # L1 regularization
max_iter=100, # Max iterations
max_threads=None # Auto-detect CPU cores
)
Exporting Dictionary Files
# Export trained model to dictionary files
lindera.export(
model="model.dat", # Trained model
output="exported_dict/", # Output directory
metadata="metadata.json" # Optional metadata file
)
This will create:
lex.csv: Lexicon filematrix.def: Connection cost matrixunk.def: Unknown word definitionschar.def: Character definitionsmetadata.json: Dictionary metadata (if provided)
See examples/train_and_export.py for a complete example.
API Reference
Core Classes
TokenizerBuilder: Fluent builder for tokenizer configurationTokenizer: Main tokenization engineToken: Individual token with text, position, and linguistic featuresCharacterFilter: Text preprocessing filtersTokenFilter: Token post-processing filtersMetadata: Dictionary metadata and configurationSchema: Dictionary schema definition
Training Functions (requires train feature)
train(): Train a morphological analysis model from corpusexport(): Export trained model to dictionary files
See the test_basic.py file for comprehensive API usage examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 47.2 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11c1e5a713bb49677386d3fe7f20341f7fb56e900c4719f9ce0e9809b6b54f5f
|
|
| MD5 |
982c5e30a97571f1a4427a75c4ef08ba
|
|
| BLAKE2b-256 |
9cef2b8137f4ef963ec5fed0082a2cb5b08a1a1d5810b5983ca1c08f0809e258
|
File details
Details for the file lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 47.2 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
072d3120c1512305cc8a89e736dc7cd0d1f126e01880703832c1c37ee0f95f2a
|
|
| MD5 |
6657b85f1a54ee556e0e4ff927605b30
|
|
| BLAKE2b-256 |
923a0cdc816e5aef2b69af18bd32bfe1f5f72f52a941fab650a0a66d4c876cdd
|
File details
Details for the file lindera_python_unidic-2.0.0-cp314-cp314t-win_arm64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp314-cp314t-win_arm64.whl
- Upload date:
- Size: 46.6 MB
- Tags: CPython 3.14t, Windows ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82867d952261116aafc91e5cfba27bcedd6e43cad9900feac5120c3a938c3007
|
|
| MD5 |
73941d6f78ca0e24fbd5d1451e613fac
|
|
| BLAKE2b-256 |
5bd0a6b9ea9ef38dad5dc5aa15eb4b8e071fe06461185f0a6a6cb5cca0993a3e
|
File details
Details for the file lindera_python_unidic-2.0.0-cp313-cp313t-win_arm64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp313-cp313t-win_arm64.whl
- Upload date:
- Size: 46.6 MB
- Tags: CPython 3.13t, Windows ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a65a1e75fc5cc4ad986aba12060aff9fbfd118558b03846ccbe723152d759c6a
|
|
| MD5 |
6bf2fae77c1750a6d4fbb41aef9e9180
|
|
| BLAKE2b-256 |
aa3e274b17f9a21f786f7411dfdb3abaa3ff2fd421676a18bb1a2b1d1265b47e
|
File details
Details for the file lindera_python_unidic-2.0.0-cp310-abi3-win_arm64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp310-abi3-win_arm64.whl
- Upload date:
- Size: 46.6 MB
- Tags: CPython 3.10+, Windows ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8d467b87726f9e2953683189f00df0592ae9d679192398dfe05b0dce33b9555
|
|
| MD5 |
4a3a7f819f991038c85a589af97e70b1
|
|
| BLAKE2b-256 |
cedded73fb0575e9ef9afdaa73d0a6ca2859794dd1a39220b00e45e21ac30824
|
File details
Details for the file lindera_python_unidic-2.0.0-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 46.7 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b90465f58d69bd2964720d739a406906eebfc36eaf96d68e64eb17551aa0ab10
|
|
| MD5 |
67d54827c248a2202191a77da439e7d6
|
|
| BLAKE2b-256 |
52d31fae62d56ba80fa3991c6b35436825ae3d62d4dfad7843b9652fdb82a980
|
File details
Details for the file lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 47.2 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54bf935efe3a9742ee6cc7a15c4465382bd9e56113c09ba696f031dec6249f4f
|
|
| MD5 |
0f535e88c4c63158df9fc7b4ea1f708f
|
|
| BLAKE2b-256 |
4ff116b3958db6b0fa3080893d16169a2a6117ea8844b8c42ed935575d73f037
|
File details
Details for the file lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 47.2 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
249f9b6d7e73a8c9cdeb3bca8cd4b0d40d49ea9cc76912e36a522a407edefac5
|
|
| MD5 |
198d816bb47e5c356e25cd369e2f0d28
|
|
| BLAKE2b-256 |
1d598233d5037dd2966ef2a08683e4000381eeaad6e4b039b3b2330450c8a7a6
|
File details
Details for the file lindera_python_unidic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 47.3 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e295ca68f3b1caf59169ceaec6a560b74350ea02c9e08c0863451c24bcd0599
|
|
| MD5 |
728eb292e4545a9568f4d2f774b5ebbc
|
|
| BLAKE2b-256 |
e0f3d14372be035886d4128c4dd9315d66920f4a0006f5eefa0fbebaae3af7cd
|
File details
Details for the file lindera_python_unidic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: lindera_python_unidic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 47.0 MB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1528ad9782575fa0e193b8a5bf5036d6ebbb54fe872ccb243836db7355ba8272
|
|
| MD5 |
d4d4b4517a737da5339e7e217d64ddf2
|
|
| BLAKE2b-256 |
9ce146b227db93eef8a9fbe3ddff5edfc2344831a7c674a39fad2d62348cab5a
|