Skip to main content

Python binding for Lindera (no embedded dictionaries)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 3.0.0 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC, IPADIC-NEologd, UniDic
  • Korean: ko-dic
  • Chinese: CC-CEDICT, Jieba
  • Custom: User dictionary support

Pre-built dictionaries are available from GitHub Releases. Download a dictionary archive (e.g. lindera-ipadic-*.zip) and specify the extracted path when loading.

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary from a local path (download from GitHub Releases)
dictionary = load_dictionary("/path/to/ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("/path/to/ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("/path/to/ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python-3.0.3-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python-3.0.3-cp314-cp314t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python-3.0.3-cp313-cp313t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python-3.0.3-cp310-abi3-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python-3.0.3-cp310-abi3-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python-3.0.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python-3.0.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python-3.0.3-cp310-abi3-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python-3.0.3-cp310-abi3-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python-3.0.3-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c23cd5ff8a130664e128ada39ee25a44df7ef043fdc5daf742244aa9e4683090
MD5 ff1e3cfbf19a2a026b8b4c7b7cd5fd64
BLAKE2b-256 655d0aaec3930b1c11e5088f01d79d1fb200fe614accaead688bd8fffd0576b3

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 538ca79141d117a2eef537672198077afe5f0b7cb60da5b67578a3976a901ac1
MD5 6b74bcf0cb7abe1c54885ff41b721085
BLAKE2b-256 b83615a747008b3d3abb8bc4b4ca11ac429ced54d7f3a6d2a7c8ce9ce516203a

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 24020146ff8e0eb0644c925a3947a551788ede2bb01fd27cd102d008ee240aed
MD5 688db8c6037b3a04378a7893a2cdc650
BLAKE2b-256 da6f36155bc69abd9e2ba0a61b4477c777854a025947b391d12a9adfe262ab9c

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 08606462f72423d27b3e7ab8af69db67830910b0d5f2241be1fe3c857d6d870b
MD5 9d5abce02f7128b18072690458c9af7d
BLAKE2b-256 aa8bc85bf030748f7465d25d02059d69f5fd2231ae7195fdd4ee7ff9c3294d37

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 0a0333e56ffe5562b4dcb65e02e2fe7940ab605fcddb0dd9183e3543d62a4fab
MD5 96b69e48ff129fba2813fc115c4abe4a
BLAKE2b-256 c5432cd2991698b886adae5474337e67cb9c7db33d6cf972ae8d84f2d6b18531

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 db5df57ccdd46fb1563550323b63fd6306ebeb01986f657abf69783dbc93e54c
MD5 25f0cd5a4874c5282ed2b22bdbdbdce2
BLAKE2b-256 0d239dc06cc26c6159b53b02dd2411a5f2e7b3084ebcd3386b2166564f256df6

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 96377f71fe8bd0e5cbdb4850c5c536539c8186cb640ca3de80f86f4d6a338d49
MD5 4f47d046e50f2bebf4c8e5ba0c1c0e2d
BLAKE2b-256 6ffe94efb422751062cdb2870383ecc22d53f88a541446002797e6d7c30dcf4d

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c76635a78f86e36408d149b7be9173866674d9fbb14db405605db6a4cbf218d3
MD5 a134669be8263af8d68c5d4b6f7cce7a
BLAKE2b-256 340a6fcc7bf6257230f97fcec9fdd375277db01578346305229f68ee5403718e

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4eee1bb13f27ba5ff4c8263c804877d712b78422f2fa1a02921ff3b656a869a6
MD5 cabb93b24cd1207158b1b9351e663158
BLAKE2b-256 ea6ed4065a978c38fc0508f881cf206d4513155ca2d0351d148c71087409abcd

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.3-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.3-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fd49f5f5758d163d50471ab75bb252e5326c354ddbb7b5a3fe2e16fdf363bfc2
MD5 04b032e3fb3be07594eaf573a3bffce5
BLAKE2b-256 1180b1e310a259da3e1006f8fe2e93367716d423a436a524b529b700eee65f92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page