Skip to main content

Python binding for Lindera (no embedded dictionaries)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 3.0.0 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC, IPADIC-NEologd, UniDic
  • Korean: ko-dic
  • Chinese: CC-CEDICT, Jieba
  • Custom: User dictionary support

Pre-built dictionaries are available from GitHub Releases. Download a dictionary archive (e.g. lindera-ipadic-*.zip) and specify the extracted path when loading.

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary from a local path (download from GitHub Releases)
dictionary = load_dictionary("/path/to/ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("/path/to/ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("/path/to/ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python-3.0.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python-3.0.1-cp314-cp314t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python-3.0.1-cp313-cp313t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python-3.0.1-cp310-abi3-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python-3.0.1-cp310-abi3-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python-3.0.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python-3.0.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python-3.0.1-cp310-abi3-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python-3.0.1-cp310-abi3-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python-3.0.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 245e6c4b8e7da705d171e3e16b2857434f490dda1443294f845efc59a62f7426
MD5 e2fc2c92c33b6d963d7d437fbc4dfa7b
BLAKE2b-256 101d012746bea3168fdd4d9b16297cccf86ecfc2a99b7b0d2c2d7fa2d3618565

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 17d806005712759f96d7042c8ba9162b5351c0d28ddfe9db5c3db1beb2fb6fb6
MD5 f76ce65873ce9865a3888ad47bb56f7d
BLAKE2b-256 14c566fc5ad8c72e2a8f465f2258663ea7370039ec580273ea2524324c314a58

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 b5e42b5f9bc8ff1308377c2c212b1b4b8371f58286ae1cd53fa762348fb59eae
MD5 f1611d1942b8dbbe5d0218a8285e6d7b
BLAKE2b-256 f5a39e767b2ba046ea2caceaae726d39bf0448e719d25410fb0f05d3ef8e2850

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 1a721d688bfcde3754adc2b55c591006c5333eb64da5d9570f97037188b1ea0a
MD5 1cbb0f2547dc0b3b5344684a940af641
BLAKE2b-256 5428f8c108d45f58f304a3e294b65f37180bf8ddc9d7a16de3c1c95dc2373bb6

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 47a1fb192bb4dc1a333842dee6316e24595b047ac2d0a8e3e135bcd56d9520a5
MD5 ec252c2cb37949ade5a8e5aa02fdd45e
BLAKE2b-256 b63cd5b5d51c7812d5bcf74fcb39d0d5a9ab6bbd036fcc41b9bdcb7d822110b3

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a5c62346242e804fa000f084c6f63250a3b9a5ae9b6ee9baaf11f624523bb511
MD5 4ab85731cdcc8064f8c060d0a48e2c00
BLAKE2b-256 b16ec8a04512c5c172cf3cf983686120ca528950df12bafe7f3e27b4f96fe818

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b99ce4b83a05d4a53203f3838fe1bb7b1cf0c0432802dd4d4b2b2b4c492b1d4e
MD5 300da191507811f62947f76740e5e57d
BLAKE2b-256 de27de9194e5977476e26b2f16ea9f5006a3bc4174657e026885e96ceceb6924

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9af43cddc971ebd8357fc3c0435d7c64e803c525dc1f81dc016d9764238b1e9e
MD5 ac14bb92c7838d9d1d550fae74bf5710
BLAKE2b-256 aafef8d7adbbdcd191173134a4f58a9c822c394b27c3fe5c8c2f494c87aa6c06

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ad05d207d70dcb94af491c7837c5ef748e21e06f051c130db8188d3671c7a0ac
MD5 46e418cab2f360091fcdeecf3f18b087
BLAKE2b-256 d5278588ff1f06b3cf50130bf05676990d4b4df0e1847623441bc4ce804ffff3

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b15401433fe4514f2bce209f890ff56094af0b2614a43a9314b6ab45ac22e8f8
MD5 a6bd53a9a38bd23b2a8f36c491d7328e
BLAKE2b-256 3358d04c6ce9d676667a050aecd603710312c85d182cd3e1c13e75fbd39c56ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page