Skip to main content

Rust-accelerated sentence boundary detection compatible with bunkai

Project description

⚡ fast-bunkai

Publish PyPI

⚡ FastBunkai is a Python library that splits long Japanese and English texts into natural sentences, providing a highly compatible API with megagonlabs/bunkai while its Rust core delivers roughly 40–285× faster segmentation than the original Python implementation.

⚡ fast-bunkai は、日本語・英語の長い文章を自然な文単位に切り出すための Python ライブラリです。純粋 Python 実装である megagonlabs/bunkai と高い互換性がある API を提供しつつ、内部を Rust で最適化することで、オリジナルの Python 版と比べ約40〜285倍の高速化を実現しています。


目次|Table of Contents

✨ Highlights

  • 🔁 Drop-in replacement: mirrors the FastBunkai / Bunkai APIs and annotations, including Janome-based morphological spans.
  • 🦀 Rust-powered core: heavy annotators (facemark, emoji, dot exceptions, indirect quotes, etc.) run inside a PyO3 module that releases the Python GIL.
  • Serious speed: real-world workloads observe 40×–285× faster segmentation than pure Python bunkai (details below).
  • 🧵 Thread-safe by design: no global mutable state; calling FastBunkai concurrently from threads or asyncio tasks is supported.
  • 🛫 CLI parity: ships a fast-bunkai executable compatible with bunkai’s pipe-friendly interface and --ma morphological mode.

🚀 Quick Start

Install

uv pip install fast-bunkai

Python Usage

from fast_bunkai import FastBunkai

splitter = FastBunkai()
text = "羽田から✈️出発して、友だちと🍣食べました。最高!また行きたいな😂でも、予算は大丈夫かな…?"
for sentence in splitter(text):
    print(sentence)

Output:

羽田から✈️出発して、友だちと🍣食べました。
最高!
また行きたいな😂
でも、予算は大丈夫かな…?

🧰 CLI Examples

fast-bunkai provides the same pipe-friendly command-line interface as bunkai.

echo -e '宿を予約しました♪!▁まだ2ヶ月も先だけど。▁早すぎかな(笑)楽しみです★\n2文書目です。▁改行を含みます。' \
  | uvx fast-bunkai

Output (sentence boundaries marked with , newlines preserved via ):

宿を予約しました♪!▁│まだ2ヶ月も先だけど。▁│早すぎかな(笑)│楽しみです★
2文書目です。▁│改行を含みます。

Morphological output is also available:

echo -e '形態素解析し▁ます。結果を 表示します!' | uvx fast-bunkai --ma
形態素	名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
▁
EOS
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS
結果	名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
	記号,空白,*,*,*,*, ,*,*
表示	名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
!	記号,一般,*,*,*,*,!,!,!
EOS

📊 Benchmarks

Reproduce the bundled benchmark suite (correctness check + timing vs. bunkai):

uv run python scripts/benchmark.py --repeats 3 --jp-loops 100 --en-loops 100 --custom-loops 10

Latest local run (2025-10-11) reported:

Corpus Docs bunkai (mean) fast-bunkai (mean) Speedup
Japanese 200 253.92 ms 5.55 ms 45.72×
English 200 209.77 ms 4.94 ms 42.48×
Long text* 20 1330.95 ms 4.67 ms 285.10×

*Long text corpus contains mixed Japanese/English paragraphs with emojis and edge cases; the Rust pipeline processes characters in a single pass, whereas pure Python bunkai stacks regex scans, so the gap widens dramatically on longer documents.

Actual numbers vary by hardware, but the Rust core consistently outperforms pure Python bunkai by an order of magnitude or more.

🧠 Architecture Snapshot

  • 🦀 Rust core (src/lib.rs): facemark & emoji annotators, dot/number exceptions, indirect quote handling, and more. Uses PyO3 abi3 bindings and releases the GIL with py.allow_threads.
  • 😀 Emoji metadata (src/emoji_data.rs): generated via scripts/generate_emoji_data.py, mapping Unicode codepoints to bunkai-compatible categories.
  • 🐍 Python layer (fast_bunkai/): wraps the Rust segment function, mirrors bunkai annotations with dataclasses, and builds Janome spans through MorphAnnotatorJanome for drop-in parity.

🛠️ Development Workflow

uv sync --reinstall
uv run python scripts/generate_emoji_data.py  # regenerate emoji table when emoji libs change
uv run tox -e pytests,lint,typecheck,rust-fmt,rust-clippy

For manual Rust checks:

cargo test face_mark_detection_matches_reference
cargo fmt --all
cargo clippy --all-targets -- -D warnings

🧪 Testing & Quality Gates

  • pytest (tests/test_compatibility.py): ensures Japanese・English texts, emoji-heavy samples, and parallel execution match bunkai outputs.
  • 🧹 Ruff: lint + format checks via tox -e lint,format-check.
  • 🧠 Pyright: type-checks the Python API surface.
  • 🧪 Rust unit tests: validate annotator logic remains in sync with reference behaviour.
  • 📈 Benchmarks: scripts/benchmark.py validates speed + correctness; normally executed in CI to avoid long local runs.

🙏 Acknowledgements

FastBunkai stands on the shoulders of the megagonlabs/bunkai project—ありがとうございます!

📄 License

Apache License 2.0

👤 Author

Yuichi Tateno (@hotchpotch)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bunkai-0.1.1-cp310-abi3-win_amd64.whl (707.0 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_x86_64.whl (961.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_aarch64.whl (933.7 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

fast_bunkai-0.1.1-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (1.7 MB view details)

Uploaded CPython 3.10+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file fast_bunkai-0.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for fast_bunkai-0.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d33f859db83ff14e9c6a21b8acda516649f20d3d71be0788bdb06e4d9c574afc
MD5 f889628e2a1d84b022dc31f24e76093f
BLAKE2b-256 e9421d78cffeb4fa941388ef0a8a4df1b16c172164e1e0c0bda810e8c51f6594

See more details on using hashes here.

File details

Details for the file fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c35d36ab302484ec0c47b323acbb6814cd5bd2a9138c965d8b0ebbd55a70c57a
MD5 e63c1a2768f55e9599be3148fa9f3415
BLAKE2b-256 614708234de4b17706af6def621515a3764ee235bbe0cc0efe7521230df51180

See more details on using hashes here.

File details

Details for the file fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for fast_bunkai-0.1.1-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1b700a41d64b9dfac702f4323d465677f1939ad6d424cc7fb3ba9c4a5aa06d5c
MD5 5f7e167a0c57a5bbc2832050e425028d
BLAKE2b-256 2090b8741642ce6f6616af41c951f03535c49001028ffde50aeb6a3a5f7781ad

See more details on using hashes here.

File details

Details for the file fast_bunkai-0.1.1-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for fast_bunkai-0.1.1-cp310-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 e0fa684fc542266808ccfac8742f1b95c59a864bd0f06a94b5cbfe8adb6955c2
MD5 40566b12c2c061f3b7978d49369cda57
BLAKE2b-256 8867039584892f226e677544455981db3b540fa9948a3177997eb19fe7a553aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page