Skip to main content

Python binding for Lindera with IPADIC dictionary

Project description

lindera-python

Python binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, UniDic), Korean (ko-dic), Chinese (CC-CEDICT)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC (embedded), UniDic (embedded)
  • Korean: ko-dic (embedded)
  • Chinese: CC-CEDICT (embedded)
  • Custom: User dictionary support

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Install Python

# Install Python
% pyenv install 3.13.5

Setup repository and activate virtual environment

# Clone lindera project repository
% git clone git@github.com:lindera/lindera.git
% cd lindera

# Create Python virtual environment and initialize
% make init

# Activate Python virtual environment
% source .venv/bin/activate

Install lindera-python in the virtual environment

This command builds the library with development settings (debug build).

(.venv) % make python-develop

Quick Start

Basic Tokenization

from lindera.dictionary import load_dictionary
from lindera.tokenizer import Tokenizer

# Load dictionary
dictionary = load_dictionary("embedded://ipadic")

# Create a tokenizer
tokenizer = Tokenizer(dictionary, mode="normal")

# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)

for token in tokens:
    print(f"Text: {token.surface}, Position: {token.byte_start}-{token.byte_end}")

Using Character Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text)  # Will apply filters automatically

Using Token Filters

from lindera import TokenizerBuilder

# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})

# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")

Integrated Pipeline

from lindera import TokenizerBuilder

# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")

# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

# Add token filters  
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")

Working with Metadata

from lindera import Metadata

# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")

# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}")  # First 5 fields

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as dictionary arguments:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")

# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
    "normalize_kanji": "true",
    "normalize_kana": "true"
})
builder.append_character_filter("mapping", {
    "mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})

# Token filters with dict configuration  
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
    "tags": ["助詞", "助動詞", "記号"]
})

# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")

tokenizer = builder.build()

See examples/ directory for comprehensive examples including:

  • tokenize.py: Basic tokenization
  • tokenize_with_filters.py: Using character and token filters
  • tokenize_with_userdict.py: Custom user dictionary
  • train_and_export.py: Train and export custom dictionaries (requires train feature)
  • Multi-language tokenization
  • Advanced configuration options

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-python supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

# Install with training support
(.venv) % maturin develop --features train

Training a Model

import lindera.trainer

# Train a model from corpus
lindera.trainer.train(
    seed="path/to/seed.csv",           # Seed lexicon
    corpus="path/to/corpus.txt",       # Training corpus
    char_def="path/to/char.def",       # Character definitions
    unk_def="path/to/unk.def",         # Unknown word definitions
    feature_def="path/to/feature.def", # Feature templates
    rewrite_def="path/to/rewrite.def", # Rewrite rules
    output="model.dat",                # Output model file
    lambda_=0.01,                      # L1 regularization
    max_iter=100,                      # Max iterations
    max_threads=None                   # Auto-detect CPU cores
)

Exporting Dictionary Files

# Export trained model to dictionary files
lindera.trainer.export(
    model="model.dat",              # Trained model
    output="exported_dict/",        # Output directory
    metadata="metadata.json"        # Optional metadata file
)

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.py for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • CharacterFilter: Text preprocessing filters
  • TokenFilter: Token post-processing filters
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • export(): Export trained model to dictionary files

See the test_basic.py file for comprehensive API usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lindera_python_ipadic-2.2.0-cp314-cp314t-win_arm64.whl (17.6 MB view details)

Uploaded CPython 3.14tWindows ARM64

lindera_python_ipadic-2.2.0-cp313-cp313t-win_arm64.whl (17.6 MB view details)

Uploaded CPython 3.13tWindows ARM64

lindera_python_ipadic-2.2.0-cp310-abi3-win_arm64.whl (17.6 MB view details)

Uploaded CPython 3.10+Windows ARM64

lindera_python_ipadic-2.2.0-cp310-abi3-win_amd64.whl (17.8 MB view details)

Uploaded CPython 3.10+Windows x86-64

lindera_python_ipadic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

lindera_python_ipadic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

lindera_python_ipadic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl (17.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

lindera_python_ipadic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl (17.9 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file lindera_python_ipadic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 65572fcb5efc3ef7a9a22ef5d46f7c9885e25eac2fac82477a6d8755fa004f6b
MD5 b9d4788051870f23450216154ccf93a7
BLAKE2b-256 8615751da6c99ce0b49a85df14edf83b20c77e348daf924f4e15614746bbc0ae

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 080b428ed3c70a082a68124d28122ebdca03116a7683b584258120946d4a737b
MD5 7a9e866b9d431a1a9527c8595e579326
BLAKE2b-256 f68a1fb05bdc82d4ed145bd6bbc26548b529c2f9fa61fad7319b1b7d6b3601f9

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 8f19ff4cb85c5c2e6cf7d2f21cd86f927940c82f696ec2ad2b5b50e057f1a483
MD5 32299bb0b982e5ba7999235b2a995580
BLAKE2b-256 c27caf92e30349b971254a9a963a7b1c10f955d5796c5376f78701afb87e4112

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 206a4a52f4552b216bbc2086793735eecc2afbc6a927bc23f28094047b978106
MD5 b36dd0d50a9c195636d09c871ff7e8b2
BLAKE2b-256 d466e68c6662c0dbd3bb5c5521bc2ead1ed384cc9845ff5697ac1c6aa8160531

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp310-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp310-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 306e916e9da30f2657f16f805bbf35264c1ebada0b43d799edbf121b2f0f1a1d
MD5 17aeb7bb966eb3fec5dc27eee49722dd
BLAKE2b-256 e95cb886bca36afe5abfe7752fa328e9ec72f173abb6a264c35133142bd0352b

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d07581bc40ec4998071935df8b247c7eb4a320117771a9299839066a0ba13da7
MD5 762a749e736e5afbe9163b565b9143bd
BLAKE2b-256 f311e7707b1704e1ca60bebfe6c8dcdf722ab9128f4402ea78631c420acca3a3

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1ebaa2703dc6f7be3c651359da2b07f5de8ce7af9d50c77aec731b32e66c658f
MD5 a99354dea05afafe99263362fe0b32dd
BLAKE2b-256 4446ed995f49961092caaeb3b0cb1d88737d7b995c3758ff5fc3e2ee60fb4f28

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 089345b3879e26455def93b4f9f3c4f8ab09764246c2385910aba17791ab37ad
MD5 615472d9d731e907843a4819ae142364
BLAKE2b-256 d8abfbb8c21a9eb3c213e7a681ce5f01bb21e2d74b2da85578f2ca7a955f865b

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f35df2031fefa4237aa940a794de8a74dc8dc2d1161425b35d41b2e363e313cc
MD5 b96ce5d95bef950db0f8257a0b1d4772
BLAKE2b-256 5b4c2945966c25ed34b0cfceaee52eb6d29da911b6694ceb6f77c575ff5b5b1b

See more details on using hashes here.

File details

Details for the file lindera_python_ipadic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for lindera_python_ipadic-2.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2e94ef31cbee2bae664e96104df9cef82f9a4b5f3c0059f49f9ccbecc10813c3
MD5 42475979d1788f81f47a9fc0f3d48f1e
BLAKE2b-256 b7bdaa3fb5449c33c1d1463633237b329f4f5c5e0560fa78884601ffd66a74b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page