Python binding for Lindera with CC-CEDICT Chinese dictionary

These details have not been verified by PyPI

Project links

Homepage

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
Flexible Configuration: Configurable tokenization modes and penalty settings
Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

TokenizerBuilder: Fluent API for building customized tokenizers
Tokenizer: High-performance text tokenization with integrated filtering
CharacterFilter: Pre-processing filters for text normalization
TokenFilter: Post-processing filters for token refinement
Metadata & Schema: Dictionary structure and configuration management
Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

Japanese: IPADIC (embedded), UniDic (embedded)
Korean: ko-dic (embedded)
Chinese: CC-CEDICT (embedded)
Custom: User dictionary support

Filter Types

Character Filters:

Mapping filter (character replacement)
Regex filter (pattern-based replacement)
Unicode normalization (NFKC, etc.)
Japanese iteration mark normalization

Token Filters:

Text case transformation (lowercase, uppercase)
Length filtering (min/max character length)
Stop words filtering
Japanese-specific filters (base form, reading form, etc.)
Korean-specific filters

Install project dependencies

pyenv : https://github.com/pyenv/pyenv?tab=readme-ov-file#installation
Poetry : https://python-poetry.org/docs/#installation
Rust : https://www.rust-lang.org/tools/install

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python

# Set Python version for this project
% pyenv local 3.13.5

# Make Python virtual environment
% python -m venv .venv

# Activate Python virtual environment
% source .venv/bin/activate

# Initialize lindera-python project
(.venv) % make init

Install lindera-python as a library in the virtual environment

This command takes a long time because it builds a library that includes all the dictionaries.

(.venv) % make develop

Quick Start

Basic Tokenization

from lindera import TokenizerBuilder

# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.text}, Position: {token.position}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー１２３"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

tokenize.py: Basic tokenization
tokenize_with_filters.py: Using character and token filters
tokenize_with_userdict.py: Custom user dictionary
train_and_export.py: Train and export custom dictionaries (requires train feature)
Multi-language tokenization
Advanced configuration options

Dictionary Support

Japanese

IPADIC: Default Japanese dictionary, good for general text
UniDic: Academic dictionary with detailed morphological information

Korean

ko-dic: Standard Korean dictionary for morphological analysis

Chinese

CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

User dictionary support for domain-specific terms
CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera

# Train a model from corpus
lindera.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

lex.csv: Lexicon file
matrix.def: Connection cost matrix
unk.def: Unknown word definitions
char.def: Character definitions
metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

TokenizerBuilder: Fluent builder for tokenizer configuration
Tokenizer: Main tokenization engine
Token: Individual token with text, position, and linguistic features
CharacterFilter: Text preprocessing filters
TokenFilter: Token post-processing filters
Metadata: Dictionary metadata and configuration
Schema: Dictionary schema definition

Training Functions (requires `train` feature)

train(): Train a morphological analysis model from corpus
export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.2.0

Feb 10, 2026

2.1.1

Feb 1, 2026

This version

2.0.0

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.3 MB view details)

Uploaded Jan 10, 2026 PyPymanylinux: glibc 2.17+ x86-64

lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.3 MB view details)

Uploaded Jan 10, 2026 PyPymanylinux: glibc 2.17+ ARM64

lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl (9.7 MB view details)

Uploaded Jan 10, 2026 CPython 3.14tWindows ARM64

lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl (9.7 MB view details)

Uploaded Jan 10, 2026 CPython 3.13tWindows ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl (9.7 MB view details)

Uploaded Jan 10, 2026 CPython 3.10+Windows ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl (9.8 MB view details)

Uploaded Jan 10, 2026 CPython 3.10+Windows x86-64

lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.3 MB view details)

Uploaded Jan 10, 2026 CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.3 MB view details)

Uploaded Jan 10, 2026 CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (10.1 MB view details)

Uploaded Jan 10, 2026 CPython 3.10+macOS 11.0+ ARM64

lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl (10.1 MB view details)

Uploaded Jan 10, 2026 CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 10, 2026
Size: 10.3 MB
Tags: PyPy, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c5ad7670a90872f1aed5856b1c204d0cc6806536ebd361aea7baae433cd04aae`
MD5	`bdc574637781716e02fbf62a2ef71aae`
BLAKE2b-256	`3994291b1e2baf7f9267c8ebb516588b6cb114ed5f187cf25137cfa290fb8b33`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jan 10, 2026
Size: 10.3 MB
Tags: PyPy, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`9db49784dc9d5c798372180174a8a476abe8be61999198de3eb0fd0f3cb42cac`
MD5	`a89ef478f2ae2d4162b12af29d737881`
BLAKE2b-256	`ffe8a2473bb13e3dabfb8fff9d84ee469686059f3e4c4edaed8e812d124817ec`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl
Upload date: Jan 10, 2026
Size: 9.7 MB
Tags: CPython 3.14t, Windows ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp314-cp314t-win_arm64.whl
Algorithm	Hash digest
SHA256	`fac70d828fa3ca5d04d3c07ee5466880dfdf4fec5396a6f05b0f17a27ba85517`
MD5	`4813d5571901a1a4a3be9c4f0c34f0cf`
BLAKE2b-256	`e45839ef7ba14e6d19df61fda4efdd68ae336975f913c61f61692821e2f9f169`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl
Upload date: Jan 10, 2026
Size: 9.7 MB
Tags: CPython 3.13t, Windows ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp313-cp313t-win_arm64.whl
Algorithm	Hash digest
SHA256	`227be7d1907f068f0ec8212d9b7d75cb3ba051fe6711e17cb09aee835972a6d8`
MD5	`cbcfa1a6a87dc1344d71f6741c8fc71b`
BLAKE2b-256	`fd49bb725b2c59881223ae610f7f67b4c62fd91d5232b2515591501a9eed7721`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl
Upload date: Jan 10, 2026
Size: 9.7 MB
Tags: CPython 3.10+, Windows ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-win_arm64.whl
Algorithm	Hash digest
SHA256	`e7be9de7b91e880e5fc5e322826c8aaee8d4bde28b937de2d86ee485a425f42d`
MD5	`4a2796423ee06f8324f8449a6226e52c`
BLAKE2b-256	`3cdc127cce2b11d76327c616690cd2ab4f72cb07455256d45f3f50b404bfc6e6`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl
Upload date: Jan 10, 2026
Size: 9.8 MB
Tags: CPython 3.10+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`34ba0a644d0c11db1b49ed9c26977e3ec4bb047c48968bc7fb68cf551ec94daa`
MD5	`9266ab67abd91cc20e2f8aa8bdf1f632`
BLAKE2b-256	`421be9f03620ef92c00acfbab1f906755c2a08fe7aaca2d8f09a722604ad366d`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 10, 2026
Size: 10.3 MB
Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`69c9abb883c4f0cd67442b2915e569942777c6d5bb1336174f9785fae7b02e82`
MD5	`ac8d31a176138f3a90a2d8dda13a9b51`
BLAKE2b-256	`302da957d8775887ecfaa0e40c4a1666697bcdb2b861ea496cd7528d02382631`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jan 10, 2026
Size: 10.3 MB
Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`9e0ce1ca1e2efc01e1ac2961de8fde60021abcb9f8710656cc3d19bcc3f0d794`
MD5	`c673edf19f51274ca19ca405531dd43d`
BLAKE2b-256	`e393ea7cc76887acc471a9f3a4fa3f094bbd1b29f9cca1b170104de17418e6d1`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Upload date: Jan 10, 2026
Size: 10.1 MB
Tags: CPython 3.10+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`ab835273761ff4b706784513670dfbdd58cf813d51f26debcd823316e9fe0d66`
MD5	`b8092d22fbc31fe85c4da71ca7a03249`
BLAKE2b-256	`0dc87fb3aabb596d3c03ddcb510ebf748905698b36b0e2b5794550a8a15c0bdb`

See more details on using hashes here.

File details

Details for the file lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Upload date: Jan 10, 2026
Size: 10.1 MB
Tags: CPython 3.10+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for lindera_python_cc_cedict-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`8d7274adf78250161368dd4e20e948a845e4f87195917fa34cf87114f4fdd6d2`
MD5	`198e7e6d49121f336a6152332970b4ca`
BLAKE2b-256	`807d8668ba141db7e6baa8f03f8814f75aea5c626947a17decf38ae0f2be675c`

See more details on using hashes here.

lindera-python-cc-cedict 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

lindera-python

Overview

Features

Core Components

Supported Dictionaries

Filter Types

Install project dependencies

Install Python

Setup repository and activate virtual environment

Install lindera-python as a library in the virtual environment

Quick Start

Basic Tokenization

Using Character Filters

Using Token Filters

Integrated Pipeline

Working with Metadata

Advanced Usage

Filter Configuration Examples

Dictionary Support

Japanese

Korean

Chinese

Custom Dictionaries

Dictionary Training (Experimental)

Building with Training Support

Training a Model

Exporting Dictionary Files

API Reference

Core Classes

Training Functions (requires train feature)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Training Functions (requires `train` feature)