A high-accuracy, from-scratch Sentence Boundary Detector (SBD) for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring.
Project description
“Even a pair of scissors deserves to be smart. Welcome to cybernetic boundary shearing.”
[!WARNING] This project is currently in alpha.
Table of Contents generated with DocToc
Manifesto
Yet Another Sentence Boundary Detector is a pair of smart scissors for text. Pointer-based, from-scratch SBD for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring. Five languages supported today (en, fr, es, ht, ja). Target is 22+.
✂ Why do I need a pair of "smart scissors" for text?
Running re.split(r'\.\s+[A-Z]') and praying. This blunt tool instantly shears titles like Mr. Smith or French corporate markers like Sté. Générale in half, scattering semantic fragments across your pipeline.
Punctuation is the most overloaded glyph set in text. A period alone does six jobs and only one is "sentence end." Generic split-on-punctuation fails on:
Dr.Inc.U.S.A.(abbreviation markers, not boundaries. ~47% of periods in news text are these)3.5M3.14(decimal points, not sentence ends)D. H. Lawrence(initials. Two periods, zero boundaries)...(ellipsis. Trailing off or sentence end? ambiguous)1.a.at line start (inline list markers impersonating sentence ends)?!inside quotes (punctuation nesting across boundaries)
And multilingual quirks a naive splitter never saw coming.
🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?
Regex is how I cut. Not what I am. My brain is a two-pass pipeline. Pass one finds every possible boundary, greedy and over-inclusive. Pass two surgically removes false positives by cross-referencing 150+ curated abbreviations across 8 semantic categories, checking context before and after each candidate. Quote spans, parentheses, list markers, ellipsis, contiguous terminators -- each gets its own refiner.
📦 Installation
Ready to do some cybernetic boundary shearing? Let's get you set up quickly and painlessly.
The Quick & Easy Way
The simplest way to get started is with pip:
pip install yasbd-lib
[!TIP] Termux (Android)
No Rust toolchain? Install pydantic-core pre-built wheels first, then retry:
pip install typing-extensions pip install pydantic-core --index-url https://termux-user-repository.github.io/pypi/ pip install "pydantic>=2.12.4,<2.13"
That's it! Blade is armed.
The From-Source Way
Prefer building from source? Clone and install manually for full control:
git clone https://github.com/speedyk-005/yasbd-lib.git
cd yasbd
pip install .
(But honestly, the pip way is way easier.)
Want to Help Make yasbd Even Better?
That's awesome. See Contributing Guide.
📟 Usage
[!TIP] Looking for the pysbd drop-in replacement? Jump straight to the Adapter section.
Initialization
from yasbd.boundary_detector import BoundaryDetector
# Or from yasbd import BoundaryDetector
# Basic setup
detector = BoundaryDetector(lang="en")
# With all options (so far.)
detector = BoundaryDetector(
# ISO 639 code (e.g., en, fr, es, ...). Defaults to `en`.
# https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
lang="fr",
# Don't split inside them. (It won't protect block quotes) Defaults to `True`.
# https://en.wikipedia.org/wiki/Block_quotation
preserve_quote_and_paren=True,
# Enable verbose logging. Defaults to `False`.
verbose=True,
)
Switching languages at runtime is a property set:
detector.lang = "es"
The rule module loads lazily on first access. Switching mid-stream reimports the module and rebinds the pattern cache. Zero config, no restarts needed.
Boundary detection
detect() tells you where each sentence stops. Integer offsets into the original string. No copies, no slicing, no bookkeeping. Feed them to whatever downstream logic you already have.
Two detection modes:
- absolute: (default) offsets count from the start of the entire input stream.
- relative: offsets reset at each paragraph boundary. A
ParagraphEOFsentinel signals the gap between paragraphs.
# absolute mode (default)
res= list(detector.detect('She turned to him, "This is great." She held the book out to show him.'))
print(res)
# [35, 70]
# relative mode with paragraph break
detector.lang = "es"
res = list(detector.detect(
"El Sr. García llegó ayer. La Sra. López también.\n\nVéase la pág. 55 del libro.",
relative=True,
))
print(res)
# [25, 48, ParagraphEOF, 27]
Segmentation
If you do not want to manage boundary offsets yourself (and who would?), segment() wraps detect() with string slicing. It yields sentences as strings, one at a time. By default it strips leading and trailing whitespace and drops empty results. Set preserve_whitespace=True to keep original spacing around boundaries.
detector.lang = "en"
# Basic sentence splitting
res = list(detector.segment("Hello world. How are you? I am fine."))
print(res)
# ['Hello world.', 'How are you?', 'I am fine.']
# Multi-paragraph with whitespace preserved
res = list(detector.segment(
"First para.\nStill first.\n\nSecond para.\nFinished.",
preserve_whitespace=True,
))
print(res)
# ['First para.', '\nStill first.', '\n\n', 'Second para.', '\nFinished.']
[!TIP] Inputs & streaming —
detect()andsegment()accept plain strings, open file streams (TextIOBase), or aStreamCleaner. Both are generators: they yield results lazily without loading the entire source into memory. Internally, the text is split on blank lines into paragraphs, and each paragraph is processed independently with offset tracking between them.
[!TIP] ParagraphStream — yasbd uses
ParagraphStreaminternally to split text into paragraph blocks. You can import it directly if you need paragraph-level processing in your own code:from yasbd.utils.paragraph_stream import ParagraphStream for para in ParagraphStream(text): print(para) # each paragraph blockYou can also skip empty lines with
skip_empty_lines=True
Cleaner
OCRd a PDF or scraping noisy text? StreamCleaner normalizes paragraphs before they hit the detector:
from yasbd.utils.cleaner import StreamCleaner
cleaner = StreamCleaner("Hello world. This is messy.")
list(cleaner)
# ['Hello world. This is messy.']
It collapses multiple spaces, strips HTML tags, removes page numbers, re-joins hyphenated words split across lines, and more. Pass it directly to detect() or segment() instead of a string.
Adapter
Migrating from pysbd? Swap the import and keep your pipeline:
# Before: from pysbd import Segmenter
from yasbd.utils.pysbd_adapter import Segmenter
seg = Segmenter(language="ja")
res = seg.segment('田中さんは「準備は完了しました」そう言って部屋を出た。U.S.A.の経済政策は非常に複雑です。')
print(res)
# ['田中さんは「準備は完了しました」そう言って部屋を出た。', 'U.S.A.の経済政策は非常に複雑です。']
Same API surface. Same Segmenter class. Same segment() method. Even the TextSpan class is there with sent, start, and end fields, hurray. It also handles leading whitespace the way pysbd expects it (trailing on the previous sentence instead of leading on the next).
🗺 Features & Roadmap
- Regex caching (compile once per language class)
- Drop-in pysbd adapter (same API, no pipeline changes)
- StreamCleaner for OCR'd and noisy text
- spaCy integration
- 22+ language targets
- CLI tool
- REST API for remote boundary detection
🏁 Benchmarks
Tested against 6 competitors (pysbd, sentencex, sentsplit, nupunkt, blingfire, sentence-splitter) across 5 languages and 7 edge cases: compound abbreviations, CJK quotes, newline wrapping, chat logs, URLs, and more.
TL;DR: yasbd ranked #1 in accuracy across almost every test, while staying competitive on speed as pure Python. blingfire is faster but brittle. pysbd and sentencex shred French abbreviations. nupunkt has an 11-second cold start. Full results, terminal output, and a performance graph can be found in benchmarks/
📜 Last note
yasbd is maintained by speedyk-005. Licensed under Mozilla Public License 2.0 — you can use it in proprietary software, but modifications to the source files must stay open under MPL 2.0. Contributions are welcome; see CONTRIBUTING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yasbd_lib-0.1.0.tar.gz.
File metadata
- Download URL: yasbd_lib-0.1.0.tar.gz
- Upload date:
- Size: 36.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a75773565ca6a3f9df6b9aa873fcc34fe63688f9ba8a7487fb9d7db0a0c36e0
|
|
| MD5 |
bc6c85232b0a3fd610411612f4a6818e
|
|
| BLAKE2b-256 |
ec0a7843af03ccc92ab18373e82749deb4d0ded551d5a8b752cb8c8393dd34c1
|
File details
Details for the file yasbd_lib-0.1.0-py3-none-any.whl.
File metadata
- Download URL: yasbd_lib-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9971d6fc6a5147b37608e287f9447b0f70ff9acdbc726ea50111b357616d1c2
|
|
| MD5 |
3c66220a243c3f941b798365672902a3
|
|
| BLAKE2b-256 |
2833a4e3b8b1b744391a20e1e273803009985a822808fbae24334b5fd16f64fb
|