Skip to main content

한국어 형태소 분석기 - MeCab-Ko Python 바인딩 (Korean Morphological Analyzer)

Project description

mecab-ko-python

Python bindings for MeCab-Ko (Korean morphological analyzer)

Overview

This package provides Python bindings for MeCab-Ko, a Korean morphological analyzer implemented in Rust. The API is compatible with KoNLPy's Mecab interface, providing high-performance Korean morphological analysis with a familiar API.

Features

  • Fast: Rust-based implementation with zero-copy parsing
  • Memory-efficient: Optimized data structures for Korean text processing
  • Thread-safe: Safe concurrent operations
  • KoNLPy-compatible: Drop-in replacement for KoNLPy's Mecab
  • Type hints: Full type annotation support for better IDE integration

Installation

From PyPI (Recommended)

pip install mecab-ko-python

Pre-built wheels are available for:

  • Linux (x86_64, aarch64)
  • macOS (x86_64, Apple Silicon)
  • Windows (x86_64)

From Source

If you need to build from source:

# Install Rust toolchain (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install maturin
pip install maturin

# Build and install
git clone https://github.com/hephaex/mecab-ko.git
cd mecab-ko/rust/crates/mecab-ko-python
maturin develop --release

Usage

from mecab_ko import Mecab

# Create tokenizer instance
mecab = Mecab()

# Extract morphemes
morphemes = mecab.morphs("안녕하세요")
print(morphemes)
# ['안녕', '하', '세요']

# Extract nouns
nouns = mecab.nouns("아버지가방에들어가신다")
print(nouns)
# ['아버지', '가방']

# Part-of-speech tagging
tagged = mecab.pos("나는 학생입니다")
print(tagged)
# [('나', 'NP'), ('는', 'JX'), ('학생', 'NNG'), ('이', 'VCP'), ('ㅂ니다', 'EF')]

# MeCab format output
result = mecab.parse("안녕하세요")
print(result)
# 안녕    NNG,*,*,안녕,*,*,*,*
# 하      XSV,*,*,하,*,*,*,*
# 세요    EF,*,*,세요,*,*,*,*
# EOS

API Reference

Mecab(dicpath=None)

Create a new Mecab tokenizer instance.

Parameters:

  • dicpath (str, optional): Path to dictionary directory

Returns:

  • Mecab: Tokenizer instance

mecab.morphs(text)

Extract morphemes from text.

Parameters:

  • text (str): Input text

Returns:

  • list[str]: List of morphemes

mecab.nouns(text)

Extract nouns from text.

Parameters:

  • text (str): Input text

Returns:

  • list[str]: List of nouns

mecab.pos(text)

Perform part-of-speech tagging.

Parameters:

  • text (str): Input text

Returns:

  • list[tuple[str, str]]: List of (surface, pos_tag) tuples

mecab.parse(text)

Parse text and return MeCab format output.

Parameters:

  • text (str): Input text

Returns:

  • str: MeCab format string with tab-separated values

Korean POS Tags

The analyzer uses the Sejong POS tag set:

  • NNG: General noun (일반 명사)
  • NNP: Proper noun (고유 명사)
  • NP: Pronoun (대명사)
  • VV: Verb (동사)
  • VA: Adjective (형용사)
  • JX: Auxiliary particle (보조사)
  • JKS: Subject particle (주격조사)
  • JKO: Object particle (목적격조사)
  • EF: Final ending (종결어미)
  • And many more...

Performance

The Rust implementation provides significant performance improvements over the original C++ implementation:

  • Fast tokenization with zero-copy parsing
  • Memory-efficient data structures
  • Thread-safe operations

Migration from KoNLPy

If you're currently using KoNLPy's Mecab, you can migrate with minimal changes:

# Before (KoNLPy)
from konlpy.tag import Mecab
mecab = Mecab()

# After (mecab-ko-python)
from mecab_ko import Mecab
mecab = Mecab()

# The API is identical
mecab.morphs("안녕하세요")
mecab.nouns("아버지가방에들어가신다")
mecab.pos("나는 학생입니다")

Development Requirements

This crate uses PyO3 to create Python bindings. Building requires Python development headers.

System Dependencies

Ubuntu/Debian:

sudo apt install python3-dev

Fedora/RHEL:

sudo dnf install python3-devel

macOS (with Homebrew):

brew install python

Windows: Install Python from python.org with "Development headers" option selected.

Build Tools

# Install maturin (PyO3 build tool)
pip install maturin

Building and Testing

# Build and install in development mode
maturin develop

# Build release wheel
maturin build --release

# Run Python tests
maturin develop && pytest tests/

Note: Standard cargo test does not work for this crate because PyO3 cdylib requires Python development headers and a proper Python environment. Use maturin develop followed by pytest instead.

Linting

# Clippy (requires Python dev headers installed)
cargo clippy

# Format
cargo fmt

Publishing to PyPI

This package uses GitHub Actions for automated publishing to PyPI. To publish a new version:

  1. Update the version in Cargo.toml and pyproject.toml
  2. Create a new git tag: git tag v0.1.0 && git push origin v0.1.0
  3. GitHub Actions will automatically build wheels and publish to PyPI

License

This project is licensed under either of:

at your option.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mecab_ko_python-0.5.0.tar.gz (399.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mecab_ko_python-0.5.0-cp313-cp313-macosx_11_0_arm64.whl (358.5 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file mecab_ko_python-0.5.0.tar.gz.

File metadata

  • Download URL: mecab_ko_python-0.5.0.tar.gz
  • Upload date:
  • Size: 399.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for mecab_ko_python-0.5.0.tar.gz
Algorithm Hash digest
SHA256 9018df883c54516004838c7ec4834b13435ef81d82df59e43da30aef3f44aa57
MD5 c3d133edccc9d2d70fde6b76405c3ec4
BLAKE2b-256 e017379fc9e6cd24954cb9d0593acb5d8f830f3b362dbbcd9a5c81b638178b1e

See more details on using hashes here.

File details

Details for the file mecab_ko_python-0.5.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mecab_ko_python-0.5.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0269236d9428e2af38b39747277c75cb84379b6c4bd32e895ae942740ee514de
MD5 435b710fc8981d19b1858c0321d6e6c1
BLAKE2b-256 774173c2058167d537faab6088d1abbc09eb44af975a7cf03522259a722f4c3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page