Skip to main content

Python binding for Lindera (no embedded dictionaries)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 3.0.0 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC, IPADIC-NEologd, UniDic
  • Korean: ko-dic
  • Chinese: CC-CEDICT, Jieba
  • Custom: User dictionary support

Pre-built dictionaries are available from GitHub Releases. Download a dictionary archive (e.g. lindera-ipadic-*.zip) and specify the extracted path when loading.

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary from a local path (download from GitHub Releases)
dictionary = load_dictionary("/path/to/ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("/path/to/ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("/path/to/ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python-3.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python-3.0.0-cp314-cp314t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python-3.0.0-cp313-cp313t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python-3.0.0-cp310-abi3-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python-3.0.0-cp310-abi3-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python-3.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python-3.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python-3.0.0-cp310-abi3-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python-3.0.0-cp310-abi3-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python-3.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ba41364e382ef8cc788e54e982c870205ffd086aaad9cd8e4c4f1c40f6a57825
MD5 fb07bde1e2c57965d7d359151c0b079d
BLAKE2b-256 2127e37663e77a2fc1f8e0a89e2eb22aaaeb0c517dbec6ef9dbc639f3e1fac6c

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1308a0d92c8744dbe4f2d434248ed9ef745c45d328e4e62ff1ba8168c4e65b4b
MD5 14d1dd750004fb5d13903b80d2057ae4
BLAKE2b-256 b93d3ffc9d1f6c6cd6eccd8ea12ea92e6ca8a556090ef456ed20dcec57901274

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 8fabb7127b72c5abdfed39034c548ce04b9700632db6616d4bab8865b1745845
MD5 28c75d71d341c13706b9300b6f5cab08
BLAKE2b-256 7d3a8d7323184b45160fecd615eba80a8aaaf99d476a6d163adfe41216a6600d

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 9118f524b25869a30e4cadc60a8a1d6e0aec135a607a5f4c950922e068635f8c
MD5 dd4bef973f47b01a2e76084bdb67aea6
BLAKE2b-256 92f61dc99bd7d1cf7491eb149efb955ad44f747702ac875056d4da8fa0596bab

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 a5b9a93ff0899d78815e41ddd556b72f5c0fb5c78bf3d11bc29d08cb32fe8c5f
MD5 030d05553f0436809c171fbf1fb570c2
BLAKE2b-256 233acc8aba534509852b7c0896681a39003b19b759160925830257710740171a

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 74d3f4d432ce3cea906824975bcfc3bfeb2511fa0c4579318e0141b4dd0786eb
MD5 17a171f10381b2f7c56dc7ea19db613a
BLAKE2b-256 845ecec9d33e658bfbe86373dbc7f7d0ea2e05a1b758bdbf8a21fe0762297835

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 670a6e1d244dba56d9eeb4b0b6c32f60f02c8578a81ebfbf9430e67eb68e45e9
MD5 2cd4725f8c2cc0a7a519fd07b925d16b
BLAKE2b-256 19fac467659174849f07f36b8ef106fbf73044d886f0874822b83da471278137

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7afe8fea3ce3f75109b785eeff4d53c22c6b61850c1e001770f20e80bb90237b
MD5 64a0f75bb9c1d1f811b0bc4d26587cbf
BLAKE2b-256 182f6879ba8938da038397eaea990cfaa5520cd65c71e949affb92a8cdfa0bb1

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e2428f5a3e2a7ae807f235938e52aaf467f5a190176fec70e79d9c6cae3fada3
MD5 8fc3a7d8b24eaab8484dbe45bf132933
BLAKE2b-256 af2281b542708776390c59892c7539e668b725d70c01402543b2817a92371b59

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 de6e2d651beec5d0f6446ccb6ef5fd9dfe355622307d8b298c3083e632068d4c
MD5 f923059940e99213a42e418069a2c5e2
BLAKE2b-256 fcb8648fd98a96d306117a100b78c44f7a5954f35040aaf16eb78ca73d88616e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page