Skip to main content

Python binding for Lindera with IPADIC dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python

# Set Python version for this project
% pyenv local 3.13.5

# Make Python virtual environment
% python -m venv .venv

# Activate Python virtual environment
% source .venv/bin/activate

# Initialize lindera-python project
(.venv) % make init

Install lindera-python as a library in the virtual environment

This command takes a long time because it builds a library that includes all the dictionaries.

(.venv) % make develop

Quick Start

Basic Tokenization

from lindera import TokenizerBuilder

# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.text}, Position: {token.position}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera

# Train a model from corpus
lindera.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_ipadic-2.0.0-cp314-cp314t-win_arm64.whl (13.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_ipadic-2.0.0-cp313-cp313t-win_arm64.whl (13.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_ipadic-2.0.0-cp310-abi3-win_arm64.whl (13.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_ipadic-2.0.0-cp310-abi3-win_amd64.whl (13.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_ipadic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_ipadic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (13.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_ipadic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (13.7 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_ipadic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl (13.7 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_ipadic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 71aedef360f4dd2b292739fd610ad33498448b40fd39180c5aa54732a25a6415
MD5 3bdc3025928370fb9d3ef00aff1cd9ce
BLAKE2b-256 af9f2281367c889d6a15079609ab3d0db71eb0a5ace5f21c290048536771146c

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 79595e9768f3f9e8022c248a174df9a5d93fbbb11985812680b91697c54246a9
MD5 e76a16869339baa1bd06d7773b94e0f7
BLAKE2b-256 30d9e65a3caaee3aab1c6f378b08d19da8263b4bd2b90f9bb01292e7fe5f37b9

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 9633f6e5c054830a5fda148f12befe8b59f825b7bf1fb88e976f9d90ddfcd80d
MD5 891b6ad15396fc9aac37348da6bc0806
BLAKE2b-256 622cf90430be68ff443aa4f9a15184ce2a1eb43daa381f7927aa3b5eda4fbc5c

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 87fe17fbb9484ef3a34863cee51138edc05d4f7aae08e5fd9f4c1538a07f7ee8
MD5 2282369f055fe440229d7eda765e6422
BLAKE2b-256 de6e7964cdb4d2b5cfcf1a36fd859ba7c25a6dafccd17b50c5f4b27e9f6d6069

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 d830ebe39f665aec2d1d4416d15a711f80b544c47a9a377d4a9d5912205e2271
MD5 949de26f2ac3802e051d061e95ddcd40
BLAKE2b-256 d519a3aa73e2443898032c70e2b0bec573f33f04da973c6eb11f2706208396c3

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1fa6efee5c0999bf785753efd678d7ca55c199005a273730f3bf0108c05de223
MD5 44055fecb1ebf5eb33a396f95e85ffd8
BLAKE2b-256 ece0583206c615744ad0edcfec05e59a4d6b68a0fa83c21cf6d8246d784cfe9f

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4bf3575dc4ae2160e74c5b4caffcaa8de34ef098abe193e88b6151df2116328a
MD5 aa499b2cc979080d0f8f58c1b5db60fd
BLAKE2b-256 d3f90a48da6dbb91500fa2eca28648bb0a0ea72b5599cec26e5aaafa2cd06ad0

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a9abccc66bbb6f8efa017249c46ca1d128596ce707b53221522e9f7b1487922e
MD5 d712df81ca45615bb5b5b653e9cf2d15
BLAKE2b-256 f19959301bec9eaba84e8f20a353837178797d41a4d26d998696dc574eaca477

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d6e356934c63b3c54acbc97e97742cf6ca7cc4e454d9a8b14ad72e7a18fccab5
MD5 4c281ed6c2cb31ef00eeea7000386c09
BLAKE2b-256 d3a84a992aca893af208c5eed5d2db68d118a93ccfc4b71b1ef6bff90c0ed95d

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ff6d9be5435ba59de12c3658291066b619702186dd0b9030d4de31fa7c141f01
MD5 b0254c3df22ddbd2c94243fb7c48e8fe
BLAKE2b-256 39502833d15955cdc2f8d878455dbae81ca4f594767e086234d1405537ae30df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page