Skip to main content

Python binding for Lindera with ko-dic Korean dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_ko_dic-2.2.0-cp314-cp314t-win_arm64.whl (37.2 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_ko_dic-2.2.0-cp313-cp313t-win_arm64.whl (37.2 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_ko_dic-2.2.0-cp310-abi3-win_arm64.whl (37.2 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_ko_dic-2.2.0-cp310-abi3-win_amd64.whl (37.3 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_ko_dic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_ko_dic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (37.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_ko_dic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl (37.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_ko_dic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl (37.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_ko_dic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d73fb17b33cf4c1dd447602f626cfa2db7f96ddd910050b2c307caafa54b4736
MD5 2e1121f5cd7ff9129df1f744989c23f2
BLAKE2b-256 6c27d1a62c39dca8b789af5cb0970ad6d82bee18e7530356d66b9eb0be0554c3

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 413e12a5ba93e23203e57d5bac3da99daecb5ed8ffe3064b2879cb2250359614
MD5 2055c4a72f3c30d3b7d089d8555ec5f4
BLAKE2b-256 ff174beb80c04318b0df44b3fc2c0c803320f5d5968a25e270930063a39f1938

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 793d88e0a511e353c0febf6ee24866a6035ef62cc7adc6519ed125e1d744335c
MD5 76a963b941c3c3ae2f174dac67664366
BLAKE2b-256 b506421ba548084141257dfb175f08abe7adf672d4ae1f1a5d26d1e8301697c5

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 a49473869f63b0005a41d9955a068d2c34a51be9bd17cf3a206acdfd67bd8ca4
MD5 37168ee528b15191d99d4b5f645d56c5
BLAKE2b-256 2a657219db11d31ee670b1cbe36c130361d7d611797318eef585ca7a21f1e536

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 9ab3356ad6312541093801b2309ed5c2249a95bc1d9297a78150d3f22b468d07
MD5 240a70a0b4ab2612e42f94c713d76e57
BLAKE2b-256 d73638f2a97a97f07410ea7ea57522718c63c21f46beebf7419e8bbc7f0e8a23

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1beab5d71ae1175a34d3dc0d2660b449903cdd5bff07288736b6f8869c038bd0
MD5 b2681294ecb409584bf95298b378612a
BLAKE2b-256 6886bf4bdaeda1eab9983ce2b7cf643b6f27d142233acae75719e6b0eaa96609

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bfce8b1b20be6d11c8cf3bfeb4b74d8bbc4b9295b01a8a7f287947e104097f1
MD5 5dff3823aa24642fb1c830804028e49f
BLAKE2b-256 37ac3f43547cdb17e961b5fc68e22948af87e8f0c22641c34043fdc502efad6a

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6ff5b892a8298f0beb905d1cb12330c222fff4127d775190b3f93706e3c98b76
MD5 1eee9569ea9bc2a1d314f38361e305b6
BLAKE2b-256 9107df802279d20ef12f45c15ddd639fba3f6a480eca518eace4ff0c3b738c5f

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1ddefd841556031d81d5616e6fe5d0268defb0e156030c6416a055bd55c3a4e7
MD5 71970f0115647fc24c6435091a680c5e
BLAKE2b-256 3bb931cb43faa31600e3058359fb2b6b3413817f4531ad8d045667e02af71037

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 914450aa644615e3cb86e9f8cc5429891a153ee7226d18b9afd5a59df11d278a
MD5 f13300c9ff501c5cb6110ea2b3087228
BLAKE2b-256 169e8303c64986594a31a4362a1b79be7237416eb556bcef79148e0feadf69e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page