한국어 형태소 분석기 - MeCab-Ko Python 바인딩 (Korean Morphological Analyzer)
Project description
mecab-ko-python
Python bindings for MeCab-Ko (Korean morphological analyzer)
Overview
This package provides Python bindings for MeCab-Ko, a Korean morphological analyzer implemented in Rust. The API is compatible with KoNLPy's Mecab interface, providing high-performance Korean morphological analysis with a familiar API.
Features
- Fast: Rust-based implementation with zero-copy parsing
- Memory-efficient: Optimized data structures for Korean text processing
- Thread-safe: Safe concurrent operations
- KoNLPy-compatible: Drop-in replacement for KoNLPy's Mecab
- Type hints: Full type annotation support for better IDE integration
Installation
From PyPI (Recommended)
pip install mecab-ko-python
Pre-built wheels are available for:
- Linux (x86_64, aarch64)
- macOS (x86_64, Apple Silicon)
- Windows (x86_64)
From Source
If you need to build from source:
# Install Rust toolchain (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install maturin
pip install maturin
# Build and install
git clone https://github.com/hephaex/mecab-ko.git
cd mecab-ko/rust/crates/mecab-ko-python
maturin develop --release
Usage
from mecab_ko import Mecab
# Create tokenizer instance
mecab = Mecab()
# Extract morphemes
morphemes = mecab.morphs("안녕하세요")
print(morphemes)
# ['안녕', '하', '세요']
# Extract nouns
nouns = mecab.nouns("아버지가방에들어가신다")
print(nouns)
# ['아버지', '가방']
# Part-of-speech tagging
tagged = mecab.pos("나는 학생입니다")
print(tagged)
# [('나', 'NP'), ('는', 'JX'), ('학생', 'NNG'), ('이', 'VCP'), ('ㅂ니다', 'EF')]
# MeCab format output
result = mecab.parse("안녕하세요")
print(result)
# 안녕 NNG,*,*,안녕,*,*,*,*
# 하 XSV,*,*,하,*,*,*,*
# 세요 EF,*,*,세요,*,*,*,*
# EOS
API Reference
Mecab(dicpath=None)
Create a new Mecab tokenizer instance.
Parameters:
dicpath(str, optional): Path to dictionary directory
Returns:
Mecab: Tokenizer instance
mecab.morphs(text)
Extract morphemes from text.
Parameters:
text(str): Input text
Returns:
list[str]: List of morphemes
mecab.nouns(text)
Extract nouns from text.
Parameters:
text(str): Input text
Returns:
list[str]: List of nouns
mecab.pos(text)
Perform part-of-speech tagging.
Parameters:
text(str): Input text
Returns:
list[tuple[str, str]]: List of (surface, pos_tag) tuples
mecab.parse(text)
Parse text and return MeCab format output.
Parameters:
text(str): Input text
Returns:
str: MeCab format string with tab-separated values
Korean POS Tags
The analyzer uses the Sejong POS tag set:
NNG: General noun (일반 명사)NNP: Proper noun (고유 명사)NP: Pronoun (대명사)VV: Verb (동사)VA: Adjective (형용사)JX: Auxiliary particle (보조사)JKS: Subject particle (주격조사)JKO: Object particle (목적격조사)EF: Final ending (종결어미)- And many more...
Performance
The Rust implementation provides significant performance improvements over the original C++ implementation:
- Fast tokenization with zero-copy parsing
- Memory-efficient data structures
- Thread-safe operations
Migration from KoNLPy
If you're currently using KoNLPy's Mecab, you can migrate with minimal changes:
# Before (KoNLPy)
from konlpy.tag import Mecab
mecab = Mecab()
# After (mecab-ko-python)
from mecab_ko import Mecab
mecab = Mecab()
# The API is identical
mecab.morphs("안녕하세요")
mecab.nouns("아버지가방에들어가신다")
mecab.pos("나는 학생입니다")
Development Requirements
This crate uses PyO3 to create Python bindings. Building requires Python development headers.
System Dependencies
Ubuntu/Debian:
sudo apt install python3-dev
Fedora/RHEL:
sudo dnf install python3-devel
macOS (with Homebrew):
brew install python
Windows: Install Python from python.org with "Development headers" option selected.
Build Tools
# Install maturin (PyO3 build tool)
pip install maturin
Building and Testing
# Build and install in development mode
maturin develop
# Build release wheel
maturin build --release
# Run Python tests
maturin develop && pytest tests/
Note: Standard cargo test does not work for this crate because PyO3 cdylib requires Python development headers and a proper Python environment. Use maturin develop followed by pytest instead.
Linting
# Clippy (requires Python dev headers installed)
cargo clippy
# Format
cargo fmt
Publishing to PyPI
This package uses GitHub Actions for automated publishing to PyPI. To publish a new version:
- Update the version in
Cargo.tomlandpyproject.toml - Create a new git tag:
git tag v0.1.0 && git push origin v0.1.0 - GitHub Actions will automatically build wheels and publish to PyPI
License
This project is licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mecab_ko_python-0.4.0.tar.gz.
File metadata
- Download URL: mecab_ko_python-0.4.0.tar.gz
- Upload date:
- Size: 385.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68c8d13ce573da733a5bde6f485c914c8777150b6aeae695bb03da6316f7f58e
|
|
| MD5 |
bd488f4225dd8cbbbceec90a258c9bec
|
|
| BLAKE2b-256 |
033640b7d95935e208a2f9f38379ec9d714db6014cc4274b79b4fa47b447c8f8
|
File details
Details for the file mecab_ko_python-0.4.0-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: mecab_ko_python-0.4.0-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 358.5 kB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f3e206cacc2fa72099c299f52b612245bcbe34fddd429f0c64b95ce96e20ec4
|
|
| MD5 |
71833af2d6b9cf4da0f802b515a0a63e
|
|
| BLAKE2b-256 |
35ed7df4c63068584b9d7eb36ece85ecf84a811678f2e46c40a0df74cedab07d
|