Skip to main content

Python binding for Lindera with CC-CEDICT Chinese dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python

# Set Python version for this project
% pyenv local 3.13.5

# Make Python virtual environment
% python -m venv .venv

# Activate Python virtual environment
% source .venv/bin/activate

# Initialize lindera-python project
(.venv) % make init

Install lindera-python as a library in the virtual environment

This command takes a long time because it builds a library that includes all the dictionaries.

(.venv) % make develop

Quick Start

Basic Tokenization

from lindera import TokenizerBuilder

# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.text}, Position: {token.position}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera

# Train a model from corpus
lindera.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl (9.7 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl (9.7 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl (9.7 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl (9.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (10.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c5ad7670a90872f1aed5856b1c204d0cc6806536ebd361aea7baae433cd04aae
MD5 bdc574637781716e02fbf62a2ef71aae
BLAKE2b-256 3994291b1e2baf7f9267c8ebb516588b6cb114ed5f187cf25137cfa290fb8b33

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9db49784dc9d5c798372180174a8a476abe8be61999198de3eb0fd0f3cb42cac
MD5 a89ef478f2ae2d4162b12af29d737881
BLAKE2b-256 ffe8a2473bb13e3dabfb8fff9d84ee469686059f3e4c4edaed8e812d124817ec

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 fac70d828fa3ca5d04d3c07ee5466880dfdf4fec5396a6f05b0f17a27ba85517
MD5 4813d5571901a1a4a3be9c4f0c34f0cf
BLAKE2b-256 e45839ef7ba14e6d19df61fda4efdd68ae336975f913c61f61692821e2f9f169

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 227be7d1907f068f0ec8212d9b7d75cb3ba051fe6711e17cb09aee835972a6d8
MD5 cbcfa1a6a87dc1344d71f6741c8fc71b
BLAKE2b-256 fd49bb725b2c59881223ae610f7f67b4c62fd91d5232b2515591501a9eed7721

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 e7be9de7b91e880e5fc5e322826c8aaee8d4bde28b937de2d86ee485a425f42d
MD5 4a2796423ee06f8324f8449a6226e52c
BLAKE2b-256 3cdc127cce2b11d76327c616690cd2ab4f72cb07455256d45f3f50b404bfc6e6

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 34ba0a644d0c11db1b49ed9c26977e3ec4bb047c48968bc7fb68cf551ec94daa
MD5 9266ab67abd91cc20e2f8aa8bdf1f632
BLAKE2b-256 421be9f03620ef92c00acfbab1f906755c2a08fe7aaca2d8f09a722604ad366d

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 69c9abb883c4f0cd67442b2915e569942777c6d5bb1336174f9785fae7b02e82
MD5 ac8d31a176138f3a90a2d8dda13a9b51
BLAKE2b-256 302da957d8775887ecfaa0e40c4a1666697bcdb2b861ea496cd7528d02382631

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9e0ce1ca1e2efc01e1ac2961de8fde60021abcb9f8710656cc3d19bcc3f0d794
MD5 c673edf19f51274ca19ca405531dd43d
BLAKE2b-256 e393ea7cc76887acc471a9f3a4fa3f094bbd1b29f9cca1b170104de17418e6d1

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ab835273761ff4b706784513670dfbdd58cf813d51f26debcd823316e9fe0d66
MD5 b8092d22fbc31fe85c4da71ca7a03249
BLAKE2b-256 0dc87fb3aabb596d3c03ddcb510ebf748905698b36b0e2b5794550a8a15c0bdb

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8d7274adf78250161368dd4e20e948a845e4f87195917fa34cf87114f4fdd6d2
MD5 198e7e6d49121f336a6152332970b4ca
BLAKE2b-256 807d8668ba141db7e6baa8f03f8814f75aea5c626947a17decf38ae0f2be675c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page