Skip to main content

Python binding for Lindera with CJK dictionaries (IPADIC, ko-dic, CC-CEDICT)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_cjk-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (63.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python_cjk-2.1.1-cp314-cp314t-win_arm64.whl (63.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_cjk-2.1.1-cp313-cp313t-win_arm64.whl (63.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_cjk-2.1.1-cp310-abi3-win_arm64.whl (63.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_cjk-2.1.1-cp310-abi3-win_amd64.whl (63.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_cjk-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (63.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cjk-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (63.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cjk-2.1.1-cp310-abi3-macosx_11_0_arm64.whl (63.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_cjk-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl (63.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cjk-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0c26d4e2d18b13a76b3c1f423a5be886f02ac42f3c64a64017a5cc37aac792b3
MD5 805b7d1ee4895628c1c598ac7ebef841
BLAKE2b-256 1675c7e13d09f1863d0feb99cfaf24d49a04fbd70686f44d36c09eebed05fb98

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e032047e1bb5d614b3772e65e76904b718fdb5fa8d7cb12e5ebc46e99049d46a
MD5 1b825f10f0e1ca6fd74e89e05bc01299
BLAKE2b-256 fda5e244ce51325c680339873573a94f55abe7f381d7878d457fac82779b48a2

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 a93f8d40bb4ba6595c934f54cc86d392f73a936a1bdafdaf4d53dc834dc671b5
MD5 34ef91624839ab619b7b73f6813dff80
BLAKE2b-256 ce6382ac10ca6344425d4d9a8da77bec3c839030adcbc0b0123ca88f65534bd2

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 dcddce7cda6d639f7fd8552ecd89a47cc6cf3009d45c12477b6e442f0586b743
MD5 a50dbce36089f64525c841b4ab54f039
BLAKE2b-256 6ba94e071e7fe5dadb213a187f8e2e5d784e50c473547edd460f22c2cfc80f9a

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 a46f1f8b4da3cd2447df8cc46ba284861dac1eae2e1e49420910c9e34a5e7f77
MD5 f183907fb898fc630c98329d5c419252
BLAKE2b-256 88b54914089e6e81563c986f086ca4c326be89dd36c3b49952ea67bd759591ef

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3c963703ee7d62087f92c1e6b2865fbff2a83b9fdd6a3d240a131deff6f627a0
MD5 08d82d0b1a84733165fc543fcf66dc54
BLAKE2b-256 1ddcf6c1ceb505ad60e3ac619f230e3708ae112fc0ecc363c4271de0efdbefd8

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8b886722282533497ec0b24e4c5c5a5017f204ff8e991f4b53b2bf12963d6e58
MD5 de3c2e91d1b41f1af383fce855fa99a1
BLAKE2b-256 6176ca68553ddc85b731c2d318e67c610b708899fd18f25b98d24ba88574850d

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2123244d978edf95d651c0c78f98b066027cd57c4db2eff086faf25b2dda9003
MD5 2584e9e94f1bebbd1827899633b6b66c
BLAKE2b-256 f0209009cff303f72f73a83c1ee8f400b6b82c35949140c0ce516977bc428485

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f37427753e365b14cad2faf2ab5b8cf42a1408b74ca5f425a800573e93d972dc
MD5 9ddcb8e77927abaa0c2877d0d4bdc0d8
BLAKE2b-256 80250553f35f3180f26eecc382bd374b4f299606d7dfbd504bd2a4e7c3eac37d

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bd74a6a100d58a59e860a911eb78f9c7acf7612fc31cc8df2a2ce297a5e4edd6
MD5 e6a3a97baef0cbf46b99abf6ad878462
BLAKE2b-256 094c4675e9be838fc860487bc419d184dd3aaa85bc70040610bf7f4e7f40de0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page