Rust-accelerated sentence boundary detection compatible with bunkai
Project description
⚡ fast-bunkai
⚡ FastBunkai is a Python library that splits long Japanese and English texts into natural sentences, providing a highly compatible API with megagonlabs/bunkai while its Rust core delivers roughly 40–285× faster segmentation than the original Python implementation.
⚡ fast-bunkai は、日本語・英語の長い文章を自然な文単位に切り出すための Python ライブラリです。純粋 Python 実装である megagonlabs/bunkai と高い互換性がある API を提供しつつ、内部を Rust で最適化することで、オリジナルの Python 版と比べ約40〜285倍の高速化を実現しています。
目次|Table of Contents
- ✨ Highlights
- 🚀 Quick Start
- 🧰 CLI Examples
- 📊 Benchmarks
- 🧠 Architecture Snapshot
- 🛠️ Development Workflow
- 🧪 Testing & Quality Gates
- 🙏 Acknowledgements
- 📄 License
- 👤 Author
✨ Highlights
- 🔁 Drop-in replacement: mirrors the
FastBunkai/BunkaiAPIs and annotations, including Janome-based morphological spans. - 🦀 Rust-powered core: heavy annotators (facemark, emoji, dot exceptions, indirect quotes, etc.) run inside a PyO3 module that releases the Python GIL.
- ⚡ Serious speed: real-world workloads observe 40×–285× faster segmentation than pure Python bunkai (details below).
- 🧵 Thread-safe by design: no global mutable state; calling
FastBunkaiconcurrently from threads or asyncio tasks is supported. - 🛫 CLI parity: ships a
fast-bunkaiexecutable compatible with bunkai’s pipe-friendly interface and--mamorphological mode.
🚀 Quick Start
Install
uv pip install fast-bunkai
Python Usage
from fast_bunkai import FastBunkai
splitter = FastBunkai()
text = "羽田から✈️出発して、友だちと🍣食べました。最高!また行きたいな😂でも、予算は大丈夫かな…?"
for sentence in splitter(text):
print(sentence)
Output:
羽田から✈️出発して、友だちと🍣食べました。
最高!
また行きたいな😂
でも、予算は大丈夫かな…?
🧰 CLI Examples
fast-bunkai provides the same pipe-friendly command-line interface as bunkai.
echo -e '宿を予約しました♪!▁まだ2ヶ月も先だけど。▁早すぎかな(笑)楽しみです★\n2文書目です。▁改行を含みます。' \
| uvx fast-bunkai
Output (sentence boundaries marked with │, newlines preserved via ▁):
宿を予約しました♪!▁│まだ2ヶ月も先だけど。▁│早すぎかな(笑)│楽しみです★
2文書目です。▁│改行を含みます。
Morphological output is also available:
echo -e '形態素解析し▁ます。結果を 表示します!' | uvx fast-bunkai --ma
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
▁
EOS
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
結果 名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
記号,空白,*,*,*,*, ,*,*
表示 名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
! 記号,一般,*,*,*,*,!,!,!
EOS
📊 Benchmarks
Reproduce the bundled benchmark suite (correctness check + timing vs. bunkai):
uv run python scripts/benchmark.py --repeats 3 --jp-loops 100 --en-loops 100 --custom-loops 10
Latest local run (2025-10-11) reported:
| Corpus | Docs | bunkai (mean) | fast-bunkai (mean) | Speedup |
|---|---|---|---|---|
| Japanese | 200 | 253.92 ms | 5.55 ms | 45.72× |
| English | 200 | 209.77 ms | 4.94 ms | 42.48× |
| Long text* | 20 | 1330.95 ms | 4.67 ms | 285.10× |
*Long text corpus contains mixed Japanese/English paragraphs with emojis and edge cases; the Rust pipeline processes characters in a single pass, whereas pure Python bunkai stacks regex scans, so the gap widens dramatically on longer documents.
Actual numbers vary by hardware, but the Rust core consistently outperforms pure Python bunkai by an order of magnitude or more.
🧠 Architecture Snapshot
- 🦀 Rust core (
src/lib.rs): facemark & emoji annotators, dot/number exceptions, indirect quote handling, and more. Uses PyO3abi3bindings and releases the GIL withpy.allow_threads. - 😀 Emoji metadata (
src/emoji_data.rs): generated viascripts/generate_emoji_data.py, mapping Unicode codepoints to bunkai-compatible categories. - 🐍 Python layer (
fast_bunkai/): wraps the Rustsegmentfunction, mirrors bunkai annotations with dataclasses, and builds Janome spans throughMorphAnnotatorJanomefor drop-in parity.
🛠️ Development Workflow
uv sync --reinstall
uv run python scripts/generate_emoji_data.py # regenerate emoji table when emoji libs change
uv run tox -e pytests,lint,typecheck,rust-fmt,rust-clippy
For manual Rust checks:
cargo test face_mark_detection_matches_reference
cargo fmt --all
cargo clippy --all-targets -- -D warnings
🧪 Testing & Quality Gates
- ✅ pytest (
tests/test_compatibility.py): ensures Japanese・English texts, emoji-heavy samples, and parallel execution match bunkai outputs. - 🧹 Ruff: lint + format checks via
tox -e lint,format-check. - 🧠 Pyright: type-checks the Python API surface.
- 🧪 Rust unit tests: validate annotator logic remains in sync with reference behaviour.
- 📈 Benchmarks:
scripts/benchmark.pyvalidates speed + correctness; normally executed in CI to avoid long local runs.
🙏 Acknowledgements
FastBunkai stands on the shoulders of the megagonlabs/bunkai project—ありがとうございます!
📄 License
Apache License 2.0
👤 Author
Yuichi Tateno (@hotchpotch)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_bunkai-0.1.1-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: fast_bunkai-0.1.1-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 707.0 kB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d33f859db83ff14e9c6a21b8acda516649f20d3d71be0788bdb06e4d9c574afc
|
|
| MD5 |
f889628e2a1d84b022dc31f24e76093f
|
|
| BLAKE2b-256 |
e9421d78cffeb4fa941388ef0a8a4df1b16c172164e1e0c0bda810e8c51f6594
|
File details
Details for the file fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 961.9 kB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c35d36ab302484ec0c47b323acbb6814cd5bd2a9138c965d8b0ebbd55a70c57a
|
|
| MD5 |
e63c1a2768f55e9599be3148fa9f3415
|
|
| BLAKE2b-256 |
614708234de4b17706af6def621515a3764ee235bbe0cc0efe7521230df51180
|
File details
Details for the file fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 933.7 kB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b700a41d64b9dfac702f4323d465677f1939ad6d424cc7fb3ba9c4a5aa06d5c
|
|
| MD5 |
5f7e167a0c57a5bbc2832050e425028d
|
|
| BLAKE2b-256 |
2090b8741642ce6f6616af41c951f03535c49001028ffde50aeb6a3a5f7781ad
|
File details
Details for the file fast_bunkai-0.1.1-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: fast_bunkai-0.1.1-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.10+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0fa684fc542266808ccfac8742f1b95c59a864bd0f06a94b5cbfe8adb6955c2
|
|
| MD5 |
40566b12c2c061f3b7978d49369cda57
|
|
| BLAKE2b-256 |
8867039584892f226e677544455981db3b540fa9948a3177997eb19fe7a553aa
|