Skip to main content

Python binding for Lindera with CJK dictionaries (IPADIC, ko-dic, CC-CEDICT)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_cjk-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (63.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python_cjk-2.2.0-cp314-cp314t-win_arm64.whl (63.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_cjk-2.2.0-cp313-cp313t-win_arm64.whl (63.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_cjk-2.2.0-cp310-abi3-win_arm64.whl (63.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_cjk-2.2.0-cp310-abi3-win_amd64.whl (63.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_cjk-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (63.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cjk-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (63.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cjk-2.2.0-cp310-abi3-macosx_11_0_arm64.whl (63.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_cjk-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl (63.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cjk-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dbb10e689ebc1b3b4bcd8c72350024a87bba07e92ff565d0ff99eca65c420c68
MD5 1f05bad3d86089d6318fefc375335686
BLAKE2b-256 bc85f7a49023da76db7b67b0d720432b8258f116d6127a1a4a5211328c5854de

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8878db9b68cee8d9a16bf46d673c6eb78f1d9541dea6f5d92bc51942d10625a0
MD5 9bfc85d587f0eea4360a4e113cf2d45f
BLAKE2b-256 faee203f71ea3bbd194747c939e5131e32c9a0335cd991911631893837c619b1

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 a938cb62e8d399a099a243c2aa348b8c8860c92873e4859b0fe1d54074bb1c06
MD5 11eb1c412268d9190704b10c0eacb977
BLAKE2b-256 f6e13ce064df47d1664cbfdbb20f1a8887eeb0b9286eb8e268ee90077e418d2a

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 2f5c110a9e226621e7645256f41c4822e8859d215194c7644e51cda3780dcbe8
MD5 b240593005ac0eb0aea25dedc9bad2b4
BLAKE2b-256 f30c1740823c82bb4cf56729a6e195b5179bc269a63c6b6e4002ad3d8e0a24c2

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 e36be242f626023101db9c6e0b5f6c588403f90cbf72be623e0101a9e4030d42
MD5 cd8655195207d800b3903c9cdedc4578
BLAKE2b-256 d3c4c267389af9018c73e29c54ab2687ae543304296e5a47c2a6e85507a93277

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 90f8cd2bbb292c1bed5c589a03de69a83f166b89aabee1322ed486932bc64c24
MD5 8609be73efc0dabb80b44a127d93c4b0
BLAKE2b-256 be87c313fafbcf48a7aa0aef9ef8fca53ff1a19171769cf2fbd2f9b067a35d80

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ba714f82535d455a067172321ff5566e656c734eb597db3f47dd56d9b162d3a
MD5 5175d2d92a594cd9d374a698c3a83e48
BLAKE2b-256 3139bd0c0b9b8cef019c7729bfad6f47febdfe52014d4163063a96121256332e

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1b5f31570d7789336b227a7ff3227e7e6478da00165faa8924ec9503f1e72beb
MD5 eb7a7e359fc9cc379ba116a0dd7f6854
BLAKE2b-256 72acca166c75b14f4cb1a946a2a62acd89644117c5420a1f50d1bf93925de5e5

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 13aeb36eff66f7002da6176fa8abbb9ab6a289c74ea9d85bf4943421aded3a33
MD5 2bcea4fefa7632f505d265cdc039fd0c
BLAKE2b-256 08427357ea2468e074c26be251fab7b2c69b4bca3fec959cb625b0285f180cc8

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 41f78ea68d19e5e75e9c66647b0e1ca1685f551e5decb8efb55fab1444e3b841
MD5 6a9a3354f4f371fca6fa791a81926bcd
BLAKE2b-256 4ee19808d3b09b522b4c2a5e6de8e685c2bf459da773dd1c164c32fefbe9b25b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page