yasbd-lib

A high-accuracy, from-scratch Sentence Boundary Detector (SBD) for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring.

These details have not been verified by PyPI

Project links

Project description

“Even a pair of scissors deserves to be smart. Welcome to cybernetic boundary shearing.”

[!WARNING] This project is currently in alpha.

Table of Contents generated with DocToc

Manifesto
- ✂ Why do I need a pair of "smart scissors" for text?
- 🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?
📦 Installation
📟 Usage
🗺 Features & Roadmap
🏁 Benchmarks
📜 Last note

Manifesto

Yet Another Sentence Boundary Detector is a pair of smart scissors for text. Pointer-based, from-scratch SBD for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring. Five languages supported today (en, fr, es, ht, ja). Target is 22+.

✂ Why do I need a pair of "smart scissors" for text?

Running re.split(r'\.\s+[A-Z]') and praying. This blunt tool instantly shears titles like Mr. Smith or French corporate markers like Sté. Générale in half, scattering semantic fragments across your pipeline. Punctuation is the most overloaded glyph set in text. A period alone does six jobs and only one is "sentence end." Generic split-on-punctuation fails on:

Dr. Inc. U.S.A. (abbreviation markers, not boundaries. ~47% of periods in news text are these)
3.5M 3.14 (decimal points, not sentence ends)
D. H. Lawrence (initials. Two periods, zero boundaries)
... (ellipsis. Trailing off or sentence end? ambiguous)
1. a. at line start (inline list markers impersonating sentence ends)
?! inside quotes (punctuation nesting across boundaries)

And multilingual quirks a naive splitter never saw coming.

🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?

Regex is how I cut. Not what I am. My brain is a two-pass pipeline. Pass one finds every possible boundary, greedy and over-inclusive. Pass two surgically removes false positives by cross-referencing 150+ curated abbreviations across 8 semantic categories, checking context before and after each candidate. Quote spans, parentheses, list markers, ellipsis, contiguous terminators -- each gets its own refiner.

📦 Installation

Ready to do some cybernetic boundary shearing? Let's get you set up quickly and painlessly.

The Quick & Easy Way

The simplest way to get started is with pip:

pip install yasbd-lib

[!TIP] Termux (Android)

No Rust toolchain? Install pydantic-core pre-built wheels first, then retry:
pip install typing-extensions
pip install pydantic-core --index-url https://termux-user-repository.github.io/pypi/
pip install "pydantic>=2.12.4,<2.13"

That's it! Blade is armed.

The From-Source Way

Prefer building from source? Clone and install manually for full control:

git clone https://github.com/speedyk-005/yasbd-lib.git
cd yasbd
pip install .

(But honestly, the pip way is way easier.)

Want to Help Make yasbd Even Better?

That's awesome. See Contributing Guide.

📟 Usage

[!TIP] Looking for the pysbd drop-in replacement? Jump straight to the Adapter section.

Initialization

from yasbd.boundary_detector import BoundaryDetector
# Or from yasbd import BoundaryDetector

# Basic setup
detector = BoundaryDetector(lang="en")

# With all options (so far.)
detector = BoundaryDetector(
	# ISO 639 code (e.g., en, fr, es, ...). Defaults to `en`.
	# https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
    lang="fr",

    # Don't split inside them. (It won't protect block quotes) Defaults to `True`.
    # https://en.wikipedia.org/wiki/Block_quotation
    preserve_quote_and_paren=True,

    # Enable verbose logging. Defaults to `False`.
    verbose=True,
)

Switching languages at runtime is a property set:

detector.lang = "es"

The rule module loads lazily on first access. Switching mid-stream reimports the module and rebinds the pattern cache. Zero config, no restarts needed.

Boundary detection

detect() tells you where each sentence stops. Integer offsets into the original string. No copies, no slicing, no bookkeeping. Feed them to whatever downstream logic you already have.

Two detection modes:

absolute: (default) offsets count from the start of the entire input stream.
relative: offsets reset at each paragraph boundary. A ParagraphEOF sentinel signals the gap between paragraphs.

# absolute mode (default)
res= list(detector.detect('She turned to him, "This is great." She held the book out to show him.'))
print(res)
# [35, 70]

# relative mode with paragraph break
detector.lang = "es"
res = list(detector.detect(
	"El Sr. García llegó ayer. La Sra. López también.\n\nVéase la pág. 55 del libro.",
	relative=True,
))
print(res)
# [25, 48, ParagraphEOF, 27]

Segmentation

If you do not want to manage boundary offsets yourself (and who would?), segment() wraps detect() with string slicing. It yields sentences as strings, one at a time. By default it strips leading and trailing whitespace and drops empty results. Set preserve_whitespace=True to keep original spacing around boundaries.

detector.lang = "en"

# Basic sentence splitting
res = list(detector.segment("Hello world. How are you? I am fine."))
print(res)
# ['Hello world.', 'How are you?', 'I am fine.']

# Multi-paragraph with whitespace preserved
res = list(detector.segment(
    "First para.\nStill first.\n\nSecond para.\nFinished.",
    preserve_whitespace=True,
))
print(res)
# ['First para.', '\nStill first.', '\n\n', 'Second para.', '\nFinished.']

[!TIP] Inputs & streaming — detect() and segment() accept plain strings, open file streams (TextIOBase), or a StreamCleaner. Both are generators: they yield results lazily without loading the entire source into memory. Internally, the text is split on blank lines into paragraphs, and each paragraph is processed independently with offset tracking between them.

[!TIP] ParagraphStream — yasbd uses ParagraphStream internally to split text into paragraph blocks. You can import it directly if you need paragraph-level processing in your own code:
from yasbd.utils.paragraph_stream import ParagraphStream

for para in ParagraphStream(text):
    print(para)  # each paragraph block
You can also skip empty lines with skip_empty_lines=True

Cleaner

OCRd a PDF or scraping noisy text? StreamCleaner normalizes paragraphs before they hit the detector:

from yasbd.utils.cleaner import StreamCleaner

cleaner = StreamCleaner("Hello  world.   This is  messy.")
list(cleaner)
# ['Hello world. This is messy.']

It collapses multiple spaces, strips HTML tags, removes page numbers, re-joins hyphenated words split across lines, and more. Pass it directly to detect() or segment() instead of a string.

Adapter

Migrating from pysbd? Swap the import and keep your pipeline:

# Before: from pysbd import Segmenter
from yasbd.utils.pysbd_adapter import Segmenter

seg = Segmenter(language="ja")
res = seg.segment('田中さんは「準備は完了しました」そう言って部屋を出た。Ｕ．Ｓ．Ａ．の経済政策は非常に複雑です。')
print(res)
# ['田中さんは「準備は完了しました」そう言って部屋を出た。', 'Ｕ．Ｓ．Ａ．の経済政策は非常に複雑です。']

Same API surface. Same Segmenter class. Same segment() method. Even the TextSpan class is there with sent, start, and end fields, hurray. It also handles leading whitespace the way pysbd expects it (trailing on the previous sentence instead of leading on the next).

🗺 Features & Roadmap

Regex caching (compile once per language class)
Drop-in pysbd adapter (same API, no pipeline changes)
StreamCleaner for OCR'd and noisy text
spaCy integration
22+ language targets
CLI tool
REST API for remote boundary detection

🏁 Benchmarks

Tested against 6 competitors (pysbd, sentencex, sentsplit, nupunkt, blingfire, sentence-splitter) across 5 languages and 7 edge cases: compound abbreviations, CJK quotes, newline wrapping, chat logs, URLs, and more.

TL;DR: yasbd ranked #1 in accuracy across almost every test, while staying competitive on speed as pure Python. blingfire is faster but brittle. pysbd and sentencex shred French abbreviations. nupunkt has an 11-second cold start. Full results, terminal output, and a performance graph can be found in benchmarks/

📜 Last note

yasbd is maintained by speedyk-005. Licensed under Mozilla Public License 2.0 — you can use it in proprietary software, but modifications to the source files must stay open under MPL 2.0. Contributions are welcome; see CONTRIBUTING.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 30, 2026

This version

0.1.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yasbd_lib-0.1.0.tar.gz (36.1 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yasbd_lib-0.1.0-py3-none-any.whl (31.9 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file yasbd_lib-0.1.0.tar.gz.

File metadata

Download URL: yasbd_lib-0.1.0.tar.gz
Upload date: May 29, 2026
Size: 36.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yasbd_lib-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9a75773565ca6a3f9df6b9aa873fcc34fe63688f9ba8a7487fb9d7db0a0c36e0`
MD5	`bc6c85232b0a3fd610411612f4a6818e`
BLAKE2b-256	`ec0a7843af03ccc92ab18373e82749deb4d0ded551d5a8b752cb8c8393dd34c1`

See more details on using hashes here.

File details

Details for the file yasbd_lib-0.1.0-py3-none-any.whl.

File metadata

Download URL: yasbd_lib-0.1.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 31.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yasbd_lib-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a9971d6fc6a5147b37608e287f9447b0f70ff9acdbc726ea50111b357616d1c2`
MD5	`3c66220a243c3f941b798365672902a3`
BLAKE2b-256	`2833a4e3b8b1b744391a20e1e273803009985a822808fbae24334b5fd16f64fb`

See more details on using hashes here.

yasbd-lib 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Manifesto

✂ Why do I need a pair of "smart scissors" for text?

🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?

📦 Installation

The Quick & Easy Way

The From-Source Way

Want to Help Make yasbd Even Better?

📟 Usage

Initialization

Boundary detection

Segmentation

Cleaner

Adapter

🗺 Features & Roadmap

🏁 Benchmarks

📜 Last note

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes