Skip to main content

Python binding for Lindera with ko-dic Korean dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_ko_dic-2.1.1-cp314-cp314t-win_arm64.whl (37.2 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_ko_dic-2.1.1-cp313-cp313t-win_arm64.whl (37.2 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_ko_dic-2.1.1-cp310-abi3-win_arm64.whl (37.2 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_ko_dic-2.1.1-cp310-abi3-win_amd64.whl (37.3 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_ko_dic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_ko_dic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (37.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_ko_dic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl (37.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_ko_dic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl (37.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_ko_dic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 78c3a9341dadf26eabdbc12c80cc2a1a6f22ae7247fa9b262c5fc1b4b3865e06
MD5 819a9ee3d62691810fc1d490b8a2f664
BLAKE2b-256 b90af4e1fcc37540f7c1732c25fcb304addb210dc7fa229f1553d7a24ab1fa56

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 414670cf1457c60481d2a55af31633a4b066b81fad4e355b45945a99d53a4daf
MD5 411fe0d0ae9be42e0154cfc612b6f56f
BLAKE2b-256 ff637fd888e9393f9fa78d9837d8f5aaf8f27ebe152b639de779199dc1358644

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 2c557de05e127b70b7d9f35b7bb7246d308c6db9be80b13d1f2b7672e9e12e34
MD5 2f1c6fbb2c5e40f6dee065213a9aa349
BLAKE2b-256 176476a1de64bf6647b22b09cddaf1f9a4cf2cde70c4e5aa9be4f24a1fd9b508

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 f80eb1346d79f61f2a023a5eb84f002e8e51c068671bbe858a3350ae2cfea7f7
MD5 093645152d64cd803fcd5fdba38893c3
BLAKE2b-256 2c27098dc051b3ffc1f2f66247da13c1b394fbb24843e8e70d15e6dfc630958b

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 8a5179ce5f9a21a942ed76cbcdebb31b83886d5c56fb0860160c79730d2add2f
MD5 11817d60c4091dd529669fb05dca0999
BLAKE2b-256 cdce15410cfd677a5534717791f273bf22ff152553f74be16cef72322db7f723

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7b1119ebf6e8ad0a567bae17b3b48c10eaab401c970390ac9bda53c0d15c5f80
MD5 177357a9e236fc1df2921493fdd3bb57
BLAKE2b-256 c9f7bd0aaff7f31fe7b08c09db67a3c492e729476d1749af46ae427c28956c99

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1448207cab6bb61deaf987b875b62bae2cb13c22f2ff951be0bf23895ff1f459
MD5 2164b18e9be5055bda3912a6e59568d1
BLAKE2b-256 200ac94b1b36b341d586a95b0fdec8630d329861353de0213afc6942aa691854

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8206d5443daee142dba40883497d29e81a2c5725b8abaf011b375dc28f9e82d6
MD5 bd0dc89c1ca0727325bbb7c946011b2a
BLAKE2b-256 a95f7ee9dab06987ad4fa19d3e57bdada72dceb650fcf56bd407f0ed99384684

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 879723fa9f6c39cfcaea1063fcea6fe884845aa46783e9fbb93f5f0c0f4badfb
MD5 dfe808484e9197abcae1f4da3c307875
BLAKE2b-256 dd0684db7027f571036aa2c464ad2b16aef3aa1eede8bfd5f60f951a1d6bfcf2

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b455f4ff8fa1bb6d96917f286528a158591f5a4e17b321feec955c98bd9e9f6c
MD5 f72237edb0c72e02fde7abbb0133f696
BLAKE2b-256 f4104e615b4e3f686b829a5c04a8ab4dd6738ac5ba4743be0ebc14b69b3d0fb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page