Skip to main content

Python binding for Lindera with CC-CEDICT Chinese dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_cc_cedict-2.1.1-cp314-cp314t-win_arm64.whl (12.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_cc_cedict-2.1.1-cp313-cp313t-win_arm64.whl (12.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_cc_cedict-2.1.1-cp310-abi3-win_arm64.whl (12.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_cc_cedict-2.1.1-cp310-abi3-win_amd64.whl (12.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_cc_cedict-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cc_cedict-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cc_cedict-2.1.1-cp310-abi3-macosx_11_0_arm64.whl (12.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_cc_cedict-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl (12.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cc_cedict-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4dcecd4f7ff34678b752516294a19b5acf203e926f334e703674d310c6d28518
MD5 28623ac10a6a1be893ffb59d2f0631db
BLAKE2b-256 3911a61148627709bf0c3e0dc2977cb70417d632cee499b694050e5b085cc001

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5c85951d61f5cf78d646e558f22f600a8297311f4706bfbe2b58ee8ebda9168f
MD5 9b0994aa57fd30ff5533d160203251b7
BLAKE2b-256 f1237141b266d4d92e5dcec3e4eb799e00a56aa842d5e574b6182c60e9a32461

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 cd52520ff4222a5b123a3443cef988cdff30629a89c0a6fda9103ab770d3faaa
MD5 4c8cba7c0887dbcd2670fb5ea61b84ae
BLAKE2b-256 304172f72962553c5d13f77825e38801b2fedf759c4ae9cb4a3e7d0d80d0b5e2

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 6ffeada0888e6b441d84ece0a4c83aa3b0dabe9b943be4e60f2f8be60eef78f3
MD5 5c71bc7faee5558f2f4d8a5c0abbc56b
BLAKE2b-256 c38a0ea8d391dc9c0220c5476e156522609710b5b02a7fc6b8eec7e1f256da4b

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 91a2089ff6a3f2bfbd1a4748514c6ed6ba9b51c31a4b1966e73c96a0e7640bb9
MD5 376e32a3d0c1f4fee742ae9c1086b795
BLAKE2b-256 0ac200fe9ab9ef6426e23734f380ddd9ab8dbfbb8b6f098c8d2a2316ea620ee0

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6fb43c95ee2b997ca4f6b39e2dd3366975e2a6f5b6b44e760c2b5d7468efa4b7
MD5 ba0af3654dd261a02abe39679b81a0d0
BLAKE2b-256 e15319b0c8499fe0778cd423861809b1e935560e642e29b576c381f11a5a3b25

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 568519aa1285303712462a4a1058458fa2f57c90f073168d77af17e47f3ee632
MD5 c9d1b81232b6f9125679eac9803eb50e
BLAKE2b-256 7f3f6100416325d481f4878f0b36b0247e8e5358dfffd39f28a58c55de92ec49

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3e52c65913da9f01ade8bf47444afb1913bcc3ee446902105b263ab82fa7723c
MD5 b29f547b9297d7cec689a583a6fca1d1
BLAKE2b-256 c428fc94b0221ddaf34e4fb1addd2b21f6fda69bf4a85f22b6e2684dabe90e2d

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 85c06251ed2713543e786c4a75acd02c5d1d0b41a562bc613f13af7237cca357
MD5 d2aaf2cc44782613b1731bb8645348ac
BLAKE2b-256 e6d26eb94f065c83cb696990f900255f665d74560d76f9cee19fd7843cc9fc4b

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.1.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e35f2673c6fa7962df025622caa88fd141af9d35656c0d609184882b4d4dde2d
MD5 642c40c8caa07b364903040dffa229cd
BLAKE2b-256 d15d0dc6b202835d1c4098472924aefd59967f6704621972eb2bdf131989bfd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page