Skip to main content

Python binding for Lindera (no embedded dictionaries)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 3.0.0 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC, IPADIC-NEologd, UniDic
  • Korean: ko-dic
  • Chinese: CC-CEDICT, Jieba
  • Custom: User dictionary support

Pre-built dictionaries are available from GitHub Releases. Download a dictionary archive (e.g. lindera-ipadic-*.zip) and specify the extracted path when loading.

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary from a local path (download from GitHub Releases)
dictionary = load_dictionary("/path/to/ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("/path/to/ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("/path/to/ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python-3.0.4-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python-3.0.4-cp314-cp314t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python-3.0.4-cp313-cp313t-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python-3.0.4-cp310-abi3-win_arm64.whl (1.9 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python-3.0.4-cp310-abi3-win_amd64.whl (2.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python-3.0.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python-3.0.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python-3.0.4-cp310-abi3-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python-3.0.4-cp310-abi3-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python-3.0.4-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 135db704b216723726e0ed882dbea1330e7a564b4d8f8d166549228cbaa63a2f
MD5 426696ba8d6812c3539c6301bf21fd1b
BLAKE2b-256 766400c8d2992e45085eaa28869fef1f49f9f8bddb56ce606fa27f9b60008cdf

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b09de4da212db45a0047fd8bc9bf2c042612c11cf518177d7fda9575947883d3
MD5 22fa5c64c7229f4ff1ea9e5703509784
BLAKE2b-256 b6af9bc240b77f9f19a3f0729e97a095bb738c436ffc284ca809b611babf1dc1

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 12fb151f7d75a9858de5e3dc4bdbd40c1e3a8217bce6d32ca2a3ec573177e5d1
MD5 9b83cb59a0257963d7b874590892805c
BLAKE2b-256 bdfacce96d5a9632ea37af87aa8a30dab6ff21ee481ecc83923796c9ddc27494

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 62e9304965e9c0d240e93112d626ec19f68228221fd97299eb22ee96acae3185
MD5 4afb535d85c6ef62337cdf923e15a4ec
BLAKE2b-256 8bbe45cded637f695af317b1dd95d7404efba0aa13790a210aad65071bc08ea2

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 eb03a4771a343933be08492a27ef74188cb8c4590170b20664225d1efb1d2043
MD5 dbfe18fa4acd674c9c2865ea53baf199
BLAKE2b-256 ff3fa0910be815d4cd13a5fb9c1080587c92e1de845df1eaa1425640f578ee98

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f9f05c433600fdb145dcf8ecf84e28e20b9a4eb7ffa14693115c12e7facc76d6
MD5 2941c170af3fa466f79b81091efac4cf
BLAKE2b-256 7077430c06bd59657ee1bf74eb4b17d24e88478932275e95ea9c274dc586a72e

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 243a10afe05e09378e56d156ca6fecb1496f4ea83f4f1ce06607ca87ad68db3e
MD5 21ac2c769f15b8345056d21befd2fbd6
BLAKE2b-256 6933f03d315003df34b28271d44d8e5971e8ea50a580517f42650f7c7283fb41

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 10a37aec6e448ee43cb9c311d6e150387d7a8dd5c565fe42107afe8a83a178a2
MD5 9697b612c592a64841e140bec13e5b3f
BLAKE2b-256 3bf2068f34db4304d259639b7af4e991d56b4a7d66542f094afcb9acfe99abb1

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 555fbc457c64e77861f990e828eec13a56c9c9b42140de8cbf9831028efaf3e5
MD5 4198cb79144a694413b58192873d2fd7
BLAKE2b-256 8d8f09db8356fd604dedf1b430a18f512123369def5256017bf526bba15c63f3

See more details on using hashes here.

File details

Details for the file lindera_python-3.0.4-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python-3.0.4-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f6d987aec4d7273193450cc4f2fd8be4236016f464d5cc518beb674c0bef9907
MD5 77c9281f0e77f656a4d134fbe945aed1
BLAKE2b-256 f99a35202504bf1bb71f031fe839e52e6522492e1e9857d8a7b53bbdff58b5a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page