Skip to main content

Python binding for Lindera with CC-CEDICT Chinese dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_cc_cedict-2.2.0-cp314-cp314t-win_arm64.whl (12.3 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_cc_cedict-2.2.0-cp313-cp313t-win_arm64.whl (12.3 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_cc_cedict-2.2.0-cp310-abi3-win_arm64.whl (12.3 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_cc_cedict-2.2.0-cp310-abi3-win_amd64.whl (12.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_cc_cedict-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cc_cedict-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (12.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cc_cedict-2.2.0-cp310-abi3-macosx_11_0_arm64.whl (12.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_cc_cedict-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl (12.6 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cc_cedict-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 05ce37157f5c256cb98bce174d03e3e68cc9e09ccfb771cde39f9965ab3c1d43
MD5 7fa86fb105e74e04493f474e61ba97de
BLAKE2b-256 3010a46dbf432669b25a6812596bfcecfb2fc3bef567bea50aa0a77753a7c508

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c99c0df2e39ad1855598ced156844e01746ddd8132df2cd529cabe92c5b2712c
MD5 65e57c697d2f2bcd0f91b5da8950ff4a
BLAKE2b-256 1c5dbc5a13942b880aea9c7ee5abd65deb4ece0b150259bb73c6fc33a5a1915a

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 e3d21d2956b6d8761d2c93266452ca161b07ea20a6e9b812859294889316b353
MD5 f9e2764e3b8c6d2f3b46029379076ba6
BLAKE2b-256 de9ca54724b0a15da961643dd37ac749b845b0d998eeee95d7e94eca92baaad7

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 ee2483294f24ac7768b76ebf4f0a26f49154a58b127c9daa27b5eabb4a3b15c3
MD5 3b81efbffd1c200a3a2fc8bd915c0525
BLAKE2b-256 8a341f110bf73c450797d637e604401498ec52cf0238f94ab5aac0184aff6d39

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 6b4e484f1bbd6d5644f9a511dbfd653378b3bccc58bdaa22b7c15bd35e089a30
MD5 68c767764eb4aec4485bb8331c959509
BLAKE2b-256 83b061294cd34a3f275e5572fef02243dca4c5afc56fc56f35cdf5ce340d13c4

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4465c156dd161ed0909120d314541c3b210afb6d56f0f87ceb84a36871d70595
MD5 0d1c9db3454de4d4d1a3c5b3c826249c
BLAKE2b-256 218ca969ca5be8129d032201eadbcb016f79d6bd4ad9434887edeecc4643dee3

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6084db9cd696ca58a64b2455bbf7d3528847415ec10abb43e3fe042e48f584e0
MD5 ea3037c792e685b21c4d8f939c8beef2
BLAKE2b-256 62fa6b6fc39b5be1a344ef37aecc7eefe64f425b6dd9ef859e94d23db5dbec06

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1ede0fc1f8153917eca45ede33efcb088817937870c2f2c4e2d6a553ecef680f
MD5 b7947dfcdbd5d5a3ab46038334182ce0
BLAKE2b-256 ad3e53671bee604cba948ad53faf61ceef167893ebb913e598a4d42b3868e04d

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 310b7a240d4d7d5db5573b2c9bbbd7d1debd20dc93ce04c2e15614394b616538
MD5 0e2b67a8aae8a842a97f6d15e03b5cad
BLAKE2b-256 1ecd9b456e6df10680b07b5a51e0f884837da22e79d1a38b29c0ae940cb932b8

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cc_cedict-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 519d09157d5a99c04ed86e4414319bf7ee3dd97af07e800faba0af219f5c3ccb
MD5 35505031c312bddaa32df898a98b9946
BLAKE2b-256 a40f8aa44a674773531288c5381ba40b7a9c7e85e1992eccd9b4d00493599eb1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page