Skip to main content

Python binding for Lindera with UniDic dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_unidic-2.1.1-cp314-cp314t-win_arm64.whl (53.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_unidic-2.1.1-cp313-cp313t-win_arm64.whl (53.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_unidic-2.1.1-cp310-abi3-win_arm64.whl (53.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_unidic-2.1.1-cp310-abi3-win_amd64.whl (53.5 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_unidic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_unidic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (53.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_unidic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl (53.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_unidic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl (53.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_unidic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a9ce2cde6ac984610989f7837d5f3fa77d703840305106c76dc2ba8d81fd1fbd
MD5 cbe07fdfa5673c0fc4edd78ee40321bc
BLAKE2b-256 0f9377613a6e87c43daa4d6d72a6ea6f27aa5e67412b5c9492b95f1b499fff13

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 19d86c20dd38147e74f17822e218ff41327a73f6f72e0e2c7c31c003f7eb848e
MD5 9829f7a9eda1904470789841480c220a
BLAKE2b-256 5af92b6e3134670da9c50bcd588d15e96cd2a25355b3630cbef9aa71fee07793

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 a36d818d64d8694570ca8f734e3799f1c5abca3b40daa57dcb6712bc8bc70ab9
MD5 53d2cb0db268df738db4b0941b76faf6
BLAKE2b-256 407fb0dea5e5da3897f339cec98e2366f26d692b46b73664719e03da212a20fc

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 9d436b352a66319b9f2642910dbbb8323dde5e1daaf7cb465995aebdd166864f
MD5 ca761e1558469c3dc9ff6dc4fc3e8b9e
BLAKE2b-256 0987efc2d53092c62eb4af48b2f2a83f59acecf63db8873610b45dc91e0d43f8

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 7691c5e655b06c992f3f4a8185e40d3d6ffb71207c808e300edd3c19bf7a4ad9
MD5 952f8c331b0a20f95fcb7ae988cc4dd3
BLAKE2b-256 fc97c31a5748d98742a1e607d32f89fe33c3202d8055b21e83674dc84d7d38cf

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 19c6a5972d407147c2e55ba2f114de0e0a41c6630210f4a64c7f3a1cfebc778d
MD5 5441674a9464b66a76e8587aaa9048e3
BLAKE2b-256 49edfc46be561259cd395ebd34231c41cfdc9bc427c534302e8fe9c3eed7f5cc

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d95adaf45eeeb47fc026a019b496c89cc1178243053306c06e9638b38773938
MD5 338257ea5d8e8d6df503a8765a477165
BLAKE2b-256 c3c2bf13916c29eb72b54ea14c278d4901320caa169afc08af94270f4c76ed0f

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a3eadadcb6570976a9f506c2ff0e29475394931df0496965fae1a07fb0245125
MD5 3e870872f0c8835eb18153661812ac8d
BLAKE2b-256 39b0c607a300ee3e9590b310fb7ca0529779771ad08e892c4303f1853bc1d4fb

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c2378eecf8c44bb1ced9f74f999fa82baf48eb74ecfa28b01b16190822d2660
MD5 0223d0e2ff6556063d70aaa44a568f3e
BLAKE2b-256 6439b5bd26a49bf95f2edf62269ddd8778b94945e2ebee8d2d0e306cd6dc9580

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 87f3a82f346f87f73d80b80f70cff23c5566310d65197b1255c8969138853164
MD5 f3717142c746ef0351d95ace9c093fe2
BLAKE2b-256 e7c1508de23dd6cfcc73cbe79eb19bfe68f6bc627bc93b0e4c1340d6d587016d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page