Skip to main content

Python binding for Lindera with Jieba Chinese dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_jieba-2.2.0-cp314-cp314t-win_arm64.whl (23.7 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_jieba-2.2.0-cp313-cp313t-win_arm64.whl (23.7 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_jieba-2.2.0-cp310-abi3-win_arm64.whl (23.7 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_jieba-2.2.0-cp310-abi3-win_amd64.whl (23.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_jieba-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_jieba-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (24.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_jieba-2.2.0-cp310-abi3-macosx_11_0_arm64.whl (24.0 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_jieba-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl (23.9 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_jieba-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 97d0f263ae11760984adb062f41386c7d3a7704dd09071f97b86b1edfea41bb1
MD5 a2f83665356c5bac3631163ef222ed6e
BLAKE2b-256 0f50f1251666244ef719eb7f16b73597023f0d0a78d1c7a5a63a43e8af7a0daa

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dc8c13b58ad00e8ae8184b6fd61f2652fab2f55232ab09bace5bd515e94ef5eb
MD5 6f71cf6fe9845d2c9e8b54b58eae8d8c
BLAKE2b-256 917a9e934b8da7ccf7ed388e90acf2c4608377158be7a61d9c1ff2894b2c64a6

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 7de456219b7dd91b01a1022c2b09fbcc74af76e0fe57a5439d7d971ac7b9151a
MD5 25170aaf67f233945c2a8a85343703ea
BLAKE2b-256 c3313606dd460ec2f7157e51b9738d97044c65f78cba225fa6d70012ceb7f5c8

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 c26d6495e14ea79fae1dc3d3990c6bc8dde667525f19f3c09411a953ac6cb651
MD5 a1ac14a81a46a205dca08877916b4d97
BLAKE2b-256 c5b5cf094b9ec03e9c587d3af0a1198910c31ff98146a8d3cf254856d8bdd10c

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 e4c3230c1a28242b0272c0b3f0524627033e0c589b31e5591b39e4e88d7e43d7
MD5 faee174ef64f3ecebdaff2283fcd8a90
BLAKE2b-256 e50d759f4c2b8766bd1cfaa8e351fe0bb10c463ec6dc61a2c68b3485ad7e2e00

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5f59256fe17284bb096b3517e77417c523f99e3359b22ac996deb806f1276378
MD5 16facd865b100612db28fdab4b4b83c7
BLAKE2b-256 8b745a3ae728059dca2b1d8f17962792caaf1f2bb0b6c2879b8b2f9f11c0e17b

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1c267a90b17361c6a72c7d288fd96aa17ec4490fafd0a17caaffb1c67ccbfcff
MD5 d7e411ddcd59c2e7a1183ac72d3db994
BLAKE2b-256 3d76645864f146b31c8af471ec4e352c69302ca5fba9d7b7f5266732467042e8

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 28cd4da2e629729d5fd3070db621bf29458c9b9c3c8cc2c697a10bf9b3db5257
MD5 705054eec5a8916f6faafeb61dfd690d
BLAKE2b-256 7adc9817ee468097c27c3954b9115ba558829a97f38ef9d3019c832bf1739d09

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 280c058af5403d3c7861279efc17e680cca68022e9564359256c5897f220f499
MD5 9674606f2167b6d345387484b5b446bf
BLAKE2b-256 946c357e03911cbdeabe985296bd5649742a7fb40c3d7e3e11be6de66da695b3

See more details on using hashes here.

File details

Details for the file lindera_python_jieba-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_jieba-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7f36be9d2347b9acc7c240a8acee3600b8cd443b7d81140d3f0177485a3b539c
MD5 8b4e75d192f715028bf7104ee7fc2db1
BLAKE2b-256 77b8d57c87c44fbfa108570a7cfa055ea5945290d4acc1cf660bc1e0fd55ab4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page