Skip to main content

Python binding for Lindera with ko-dic Korean dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python

# Set Python version for this project
% pyenv local 3.13.5

# Make Python virtual environment
% python -m venv .venv

# Activate Python virtual environment
% source .venv/bin/activate

# Initialize lindera-python project
(.venv) % make init

Install lindera-python as a library in the virtual environment

This command takes a long time because it builds a library that includes all the dictionaries.

(.venv) % make develop

Quick Start

Basic Tokenization

from lindera import TokenizerBuilder

# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.text}, Position: {token.position}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera

# Train a model from corpus
lindera.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_ko_dic-2.0.0-cp314-cp314t-win_arm64.whl (23.1 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_ko_dic-2.0.0-cp313-cp313t-win_arm64.whl (23.1 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_ko_dic-2.0.0-cp310-abi3-win_arm64.whl (23.1 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_ko_dic-2.0.0-cp310-abi3-win_amd64.whl (23.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_ko_dic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_ko_dic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (23.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_ko_dic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (23.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_ko_dic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl (23.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_ko_dic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f1d86f6e2300901fca7a51af70efb87b83b92a1ffdb0aea3e1d0f5913c27eb96
MD5 1ac8b0e29400ac8b0f33b46f4d54e8ab
BLAKE2b-256 38414e01a6e8f5f929415c2e40d4cf6cf46c5714ea0a57167909a3fdecad978d

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ef79119a00822ffeb6460963d79798aaf136f0bc73f07cc91df51dae502a8cf6
MD5 a128103fb69ca3a2fc0354fccf05944f
BLAKE2b-256 2427b159a66da15d111fc41307559ecf973c26a49ef171b270356c422bc66458

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 2dee9bfbed72cc231df46479c55ff79a9b949393df522c2fac65026ce01914f3
MD5 eb552749167de828348236553e3e6f69
BLAKE2b-256 cf1fc526b25655491a33a7ac5af3996f0c3834e36e57a2eef60ca3a14caea0d5

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 0fa41293db27816a189d7cb3a74d2af5538c67013adae2ccfc6a63df91c5a3a9
MD5 0433269b8064bf732532bfce8acde163
BLAKE2b-256 b8007782a0cba752e9763cdc1f1b0705f550c191c09ec0e3f01538d7108f2d1a

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 d8b80d9c6b7d594851d19a8aad1b1da1dc8827d229ae6e0d11dcff05b251eaec
MD5 3023757366f7c0169fbb3e9bcaf6fc4d
BLAKE2b-256 2c21eabc12a3c986d14ae1e4af50d7e6a5025aec2e70b3d5e7a86cc74b6cdeef

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5e0b41e0a4d3af6ecaf68f9636731d6d793502fb4c24e0507a8a90a5f965737c
MD5 9798b7731db417c1decd4cdeb1edf5c7
BLAKE2b-256 ca3b141b030dcd1458869aa484213559a738c445eeab3e6a6171b0fa1b793989

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f673ff0e1e782cced2f40e49c6bd8645368c68318bdad7ba2a033472ce4c13c8
MD5 20e0ed9df8b5f2e5624c669faddb01e4
BLAKE2b-256 aa67b5e3fc4d6f26b74670f3dc060e0aec2d6bf829364897bd08b390e6721acc

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fccba545ed6e75112aede370b9171fc40f28618672917464d45da482e1d1d42e
MD5 d5e3fd832bf124414242bd1e2f4a49d8
BLAKE2b-256 e30bcb2ab366084b1dea6def91103d452ca882f3b68d6923cda348ebaac4ccd1

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 14764ce04857f81be7c731307d4f45e3b2e528fb3b823841fa23baca2181aec6
MD5 9f640175527aa30395383ffb0e7dfd41
BLAKE2b-256 1612a8b496697b932075c95480e2ead7d2530976b004be543a2048e1d5441e54

See more details on using hashes here.

File details

Details for the file lindera_python_ko_dic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ko_dic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 cb65615eb526db67edde28207d304499b2900f084faeb6b58cbf23345f20bd29
MD5 b5b7ffa6d35424ae0db3408aaf9c6804
BLAKE2b-256 4c5f973d91b843755ae534af88559f188864b8d4f418115be8785d940fd4dd7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page