Skip to main content

Python binding for Lindera with UniDic dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python

# Set Python version for this project
% pyenv local 3.13.5

# Make Python virtual environment
% python -m venv .venv

# Activate Python virtual environment
% source .venv/bin/activate

# Initialize lindera-python project
(.venv) % make init

Install lindera-python as a library in the virtual environment

This command takes a long time because it builds a library that includes all the dictionaries.

(.venv) % make develop

Quick Start

Basic Tokenization

from lindera import TokenizerBuilder

# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.text}, Position: {token.position}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera

# Train a model from corpus
lindera.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_unidic-2.0.0-cp314-cp314t-win_arm64.whl (46.6 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_unidic-2.0.0-cp313-cp313t-win_arm64.whl (46.6 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_unidic-2.0.0-cp310-abi3-win_arm64.whl (46.6 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_unidic-2.0.0-cp310-abi3-win_amd64.whl (46.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (47.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_unidic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (47.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_unidic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl (47.0 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 11c1e5a713bb49677386d3fe7f20341f7fb56e900c4719f9ce0e9809b6b54f5f
MD5 982c5e30a97571f1a4427a75c4ef08ba
BLAKE2b-256 9cef2b8137f4ef963ec5fed0082a2cb5b08a1a1d5810b5983ca1c08f0809e258

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 072d3120c1512305cc8a89e736dc7cd0d1f126e01880703832c1c37ee0f95f2a
MD5 6657b85f1a54ee556e0e4ff927605b30
BLAKE2b-256 923a0cdc816e5aef2b69af18bd32bfe1f5f72f52a941fab650a0a66d4c876cdd

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 82867d952261116aafc91e5cfba27bcedd6e43cad9900feac5120c3a938c3007
MD5 73941d6f78ca0e24fbd5d1451e613fac
BLAKE2b-256 5bd0a6b9ea9ef38dad5dc5aa15eb4b8e071fe06461185f0a6a6cb5cca0993a3e

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 a65a1e75fc5cc4ad986aba12060aff9fbfd118558b03846ccbe723152d759c6a
MD5 6bf2fae77c1750a6d4fbb41aef9e9180
BLAKE2b-256 aa3e274b17f9a21f786f7411dfdb3abaa3ff2fd421676a18bb1a2b1d1265b47e

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 d8d467b87726f9e2953683189f00df0592ae9d679192398dfe05b0dce33b9555
MD5 4a3a7f819f991038c85a589af97e70b1
BLAKE2b-256 cedded73fb0575e9ef9afdaa73d0a6ca2859794dd1a39220b00e45e21ac30824

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b90465f58d69bd2964720d739a406906eebfc36eaf96d68e64eb17551aa0ab10
MD5 67d54827c248a2202191a77da439e7d6
BLAKE2b-256 52d31fae62d56ba80fa3991c6b35436825ae3d62d4dfad7843b9652fdb82a980

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54bf935efe3a9742ee6cc7a15c4465382bd9e56113c09ba696f031dec6249f4f
MD5 0f535e88c4c63158df9fc7b4ea1f708f
BLAKE2b-256 4ff116b3958db6b0fa3080893d16169a2a6117ea8844b8c42ed935575d73f037

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 249f9b6d7e73a8c9cdeb3bca8cd4b0d40d49ea9cc76912e36a522a407edefac5
MD5 198d816bb47e5c356e25cd369e2f0d28
BLAKE2b-256 1d598233d5037dd2966ef2a08683e4000381eeaad6e4b039b3b2330450c8a7a6

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7e295ca68f3b1caf59169ceaec6a560b74350ea02c9e08c0863451c24bcd0599
MD5 728eb292e4545a9568f4d2f774b5ebbc
BLAKE2b-256 e0f3d14372be035886d4128c4dd9315d66920f4a0006f5eefa0fbebaae3af7cd

See more details on using hashes here.

File details

Details for the file lindera_python_unidic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_unidic-2.0.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1528ad9782575fa0e193b8a5bf5036d6ebbb54fe872ccb243836db7355ba8272
MD5 d4d4b4517a737da5339e7e217d64ddf2
BLAKE2b-256 9ce146b227db93eef8a9fbe3ddff5edfc2344831a7c674a39fad2d62348cab5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page