Skip to main content

Python binding for Lindera with CJK dictionaries (IPADIC, ko-dic, CC-CEDICT)

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python

# Set Python version for this project
% pyenv local 3.13.5

# Make Python virtual environment
% python -m venv .venv

# Activate Python virtual environment
% source .venv/bin/activate

# Initialize lindera-python project
(.venv) % make init

Install lindera-python as a library in the virtual environment

This command takes a long time because it builds a library that includes all the dictionaries.

(.venv) % make develop

Quick Start

Basic Tokenization

from lindera import TokenizerBuilder

# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.text}, Position: {token.position}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera

# Train a model from corpus
lindera.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_cjk-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

lindera_python_cjk-2.0.0-cp314-cp314t-win_arm64.whl (42.1 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_cjk-2.0.0-cp313-cp313t-win_arm64.whl (42.1 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_cjk-2.0.0-cp310-abi3-win_arm64.whl (42.1 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_cjk-2.0.0-cp310-abi3-win_amd64.whl (42.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_cjk-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cjk-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (42.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cjk-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (42.7 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_cjk-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cjk-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a4ba464b8b040b99e1e8ad9d2a82ae3718f68d4ad6b48445de2e4c501f7e058
MD5 9bae504da0be66ac41044918ac3de3ab
BLAKE2b-256 2dce0ba14c10f5121f4682b7082628a2a9898ec9b2eae7549ee707518be4a26a

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a5b614b7ec7dd48e8cf0e5043d474278bed4f5843723193efb2c1a27e85cb81f
MD5 be447bb8dfd0006bc16c45f5fb3011c6
BLAKE2b-256 55f522a71934496e360262d54cd505b9b9cd5ca70d39931303772995ddceed38

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 70ebe5abcc001dee3dc29addabdfed60a658b6483b95dbee236ba6ef668a8a33
MD5 696b48b8f16d6d563ae02b5705481d54
BLAKE2b-256 db755a1bf11285c304e3ac3fbf57541dc1222166dcb517a6640850022bfa9c03

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 a28782abc20ee24cb955fad45d52c77bbe2a56a383c082aba23f8d3668b0110b
MD5 298d7e540c4ab826451fe78e3b698d2b
BLAKE2b-256 21865621e6123517aa78cb89a80129829779d14049cf5124527d27c8737e564a

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 0b0122d6a41318122766520ff3736153b989d594613748e8ee6a2436f300a326
MD5 824b2a385c4e8765c5b8adf98c13e9d3
BLAKE2b-256 01c99b7f39993e1ea4cfab0d1a2bcf4d5a81a322c4f4959445289cc5207a66a1

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7815399bc443d151bbdd24173ecd9a6035619bcb8d6861a0178f96fcba9ac9ad
MD5 c86ab32f75d44a62167b3aa5a38c8ea8
BLAKE2b-256 84dfe393a49f244c30196328bfe2e9f624ed9273fde816d344fa3bea91834d90

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1309b9ecaab354bbca182c2f0fef2ed34bfaa490f5eb4e2c87190395c643a4ad
MD5 e6214d015ad9420ff9f0aaa968f9eb31
BLAKE2b-256 a0c55e11383c9ec7d4949c520e6670f73cdc8f2c3e8de7b135d2cfee8063b506

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6a21c3bfff7378d137b3ed8b845757df840de0ee0e0321e4bd299e3065fcad94
MD5 8476ac73faa451fb8525b3aeb0e97757
BLAKE2b-256 a1103102b7b0c9566d1c5c3fc854b92dadcaf1af42b2d5529943ce7e38cd67d5

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a39bfc425541a36ef6809d9076b30e86e43312ae0ea0d1ef517c15512d74ae02
MD5 5d8808079553a7b3ec87bb4a3ff65464
BLAKE2b-256 8c9564889830f84a0c35aef53a28bfa2ac7eeb85092360f0bab9d908c07c8096

See more details on using hashes here.

File details

Details for the file lindera_python_cjk-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_cjk-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e5c983068b3988c97922febb8bfd6d2b89f4f4aa393f5275f80fb995a93a4ec5
MD5 2a1998163d8ca5f09296a56aa4f387a4
BLAKE2b-256 c4078f954e3e745bc0175f54dcb648b81fd344bb88d11b867ab5dd279b71cee4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page