Skip to main content

Python binding for Lindera with UniDic dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_unidic-2.2.0-cp314-cp314t-win_arm64.whl (53.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_unidic-2.2.0-cp313-cp313t-win_arm64.whl (53.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_unidic-2.2.0-cp310-abi3-win_arm64.whl (53.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_unidic-2.2.0-cp310-abi3-win_amd64.whl (53.5 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_unidic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_unidic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (53.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_unidic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl (53.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_unidic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl (53.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_unidic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3e7e86d732503f5083375413088787713a6043befa235833d44f0fbc436e5b09
MD5 bf92f78bc24ebf23af20d74efa533903
BLAKE2b-256 d990521a5eb47f80acee065eb97d854b2552eda127249d27c2a2644e0757c05a

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3b20d49d885aedbd8b630ccb7c4156a01709b6d55fc00d350a0132e636a5513b
MD5 4ba7b9453f93480aafa1679c86f0e3e9
BLAKE2b-256 b971ff7306ec1c4aebc5a7a6316e8303684eea0b3b7a0c9c13eec2dd41c64006

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 88e278a8274e1de9ab26d4cc4214478abcaa11be9f796c83dd162d3ef18c921f
MD5 e0a81647a26a2596d873c6804fa0e6c6
BLAKE2b-256 af7746a435db267af1daa7803e3dd36b51b7b236470370ae9a2ac1851434e6f9

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 2463096874f3f96b827ea1d911483e3c4b6ccf9e11efb74075414628196ae811
MD5 14500b2125563e9159ef5f9885669c4c
BLAKE2b-256 b348531da90f0fb9046a9797a0efb8210ef76c8e80431c7d2515aede07532e49

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 232c1d901b8735f09dc4b77be6ec6c328224da94acdcc7455453f715627b07fb
MD5 c9490b88fa4989a24bd7f43bc7b7b236
BLAKE2b-256 4c18936e45af1ab00ee74d98fe0cd1803e709323e6c23476be4497ca4861649c

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 21e053a8a26ebda169541dad6a1ae8b213a6dfe28a0dc55437c584a3d8f35b22
MD5 52a7d20d2738149a0aa10d212c9dd3e6
BLAKE2b-256 799196782ae28fb5139d5c878a211c485ae33ea4fc40243f7b06132182a25990

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3989cd83a538c4433ac6a5538965c87614842fc7c127a3cad355a1c1839a7ce7
MD5 6143c4d9a162056ee17f75aa935c47df
BLAKE2b-256 00fba813b91d0cda05c537c247b62c10a5425aa15ca4c3e7d5f87c077c993d47

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 337ec2610280966e47e9a18294821ac2b674d0420cc6ca4cd25a643c1247a931
MD5 0b50cf4cb779171c2c1187f9a50ceb6a
BLAKE2b-256 92a3d0d9d22c1ce415b2270432d57e4c942f8fcfef5752ab2d3f35aa3055f83f

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 84f79b6ac5786317acc20cf48b7bbade546b415fa8951765adf12fce1d74c156
MD5 85d649364af9e4db74274963f602c9cf
BLAKE2b-256 92f6ae110f8a7781b426cb1cf647cd7da6073d0cbf3c270c15fe2fa08de68bcb

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4ff1f9064a6b1ccf9b1966ab5783fce48b83bdbe6a2f0a716243c86e72429e36
MD5 a8b6f0a6c985b5b659d35b71562c874c
BLAKE2b-256 a5080a8701d2d09ca85d718b909e1165830bad8443a8ae7256dd4ae21d26e816

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page