Skip to main content

Python binding for Lindera with IPADIC dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_ipadic-2.1.1-cp314-cp314t-win_arm64.whl (17.6 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_ipadic-2.1.1-cp313-cp313t-win_arm64.whl (17.6 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_ipadic-2.1.1-cp310-abi3-win_arm64.whl (17.6 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_ipadic-2.1.1-cp310-abi3-win_amd64.whl (17.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_ipadic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_ipadic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_ipadic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl (17.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_ipadic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl (17.9 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_ipadic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1e17dd18bb1172c20ba6029b5fc5aa17b49d251059784844715043b53ca1ec58
MD5 f35af58b33406a05ae813c8eddab88a2
BLAKE2b-256 6387c3e23962aec51e930fab3824b6d93d7a03b00a97ca16b739377922145832

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1601f9afc0a80cfd78a196b182f5dde2645814643b52985dcc920dcec61fd1f2
MD5 e2d3f9f1b96bf323141ca70d9be5cf46
BLAKE2b-256 26fbffe9f8fac706c5bc359cdc06e6ddf2d7b40e76d7d0aa97a50516ba8010ff

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 254275b031f4644711b550f209d25a5168675c802a3ba5f00788f6259e8b4bda
MD5 78e7b7ccb6446094fc12974c7813227e
BLAKE2b-256 2d7d78f6519977b5e8a474a4bc8ece7660b195b5bde27f6895f0b562d56488af

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 180d2e7aa796377ffa8e3bc5fce1b11d33cf822ac35bd5fa15b35082e1b15cb1
MD5 d03d7e894db224f94dfe768c325fa5e9
BLAKE2b-256 f1c55f4c7c11349e20292faab5d9b1e993cf1efbbeaaf2fc30fac5ffb3faf851

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 e4d1a2f7f9835274223163acaf910fc6ff71f9a815371101b53d4163c3a81e02
MD5 958ee119dba7cf43107db5e191cc18e1
BLAKE2b-256 3929f9e2c4cfb0e30131317cbad3e282bbedd33346b7a7aa03724c60ae105c91

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f2a443342109a785f63a3fabbc60d8a393aff4ea3f18667b3c009c60072d7341
MD5 5b1fd13ef0dc62ae84038006defd630a
BLAKE2b-256 c0c0089de00362176638d17ce9a98b609c73386ed5726461db9a86d81d39ce19

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 025fc4cab404355ff06ca6607637e372c8df0cd6d0a9beb0f2b0853d59613b2c
MD5 07b65d58775e51076ca0f61559a397cb
BLAKE2b-256 1905696e2791c61899c7b55af7f370b50ef7ebaa4b449de575f32e718821ac07

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 960c7e8bc713124ee0ef9bb1571b05f42a2ef3192e24ac5918847221e34c88ef
MD5 da76dc829ca214d13404822308106f5e
BLAKE2b-256 4b658dc39a399e42712afa647c0a02d5ad58475cac265ddcc2c352fb2966cb30

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ba42604a13b5c150f1166c8eb935f24f54013c014872a178029f91aa3d39560
MD5 6fc18e44f9e08a62b5b63caf4999e6a0
BLAKE2b-256 1f706cd4d22963fa5b71614f29c2c021ba80dacaaca2d0a221b1a8243c719694

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d7d6e21fa34b0720f752b542158ec1a40704bec28124e9dd549c8084e98f46ea
MD5 1371d37c0d373a6c0d73437695901330
BLAKE2b-256 3b1ab4bbf32929e23394b8d85143650c4758149389e412412653a2460f66c317

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page