Skip to main content

A high-accuracy, from-scratch Sentence Boundary Detector (SBD) for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring.

Project description

Yasbd-lib Logo

“Even a pair of scissors deserves to be smart. Welcome to cybernetic boundary shearing.”

Python Version PyPI Coverage Status Stability License: MPL 2.0 Tests CodeFactor Ask DeepWiki

[!WARNING] This project is currently in alpha.


Table of Contents generated with DocToc


Manifesto

Yet Another Sentence Boundary Detector is a pair of smart scissors for text. Pointer-based, from-scratch SBD for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring. Five languages supported today (en, fr, es, ht, ja). Target is 22+.

✂ Why do I need a pair of "smart scissors" for text?

Running re.split(r'\.\s+[A-Z]') and praying. This blunt tool instantly shears titles like Mr. Smith or French corporate markers like Sté. Générale in half, scattering semantic fragments across your pipeline. Punctuation is the most overloaded glyph set in text. A period alone does six jobs and only one is "sentence end." Generic split-on-punctuation fails on:

  • Dr. Inc. U.S.A. (abbreviation markers, not boundaries. ~47% of periods in news text are these)
  • 3.5M 3.14 (decimal points, not sentence ends)
  • D. H. Lawrence (initials. Two periods, zero boundaries)
  • ... (ellipsis. Trailing off or sentence end? ambiguous)
  • 1. a. at line start (inline list markers impersonating sentence ends)
  • ?! inside quotes (punctuation nesting across boundaries)

And multilingual quirks a naive splitter never saw coming.

🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?

Regex is how I cut. Not what I am. My brain is a two-pass pipeline. Pass one finds every possible boundary, greedy and over-inclusive. Pass two surgically removes false positives by cross-referencing 150+ curated abbreviations across 8 semantic categories, checking context before and after each candidate. Quote spans, parentheses, list markers, ellipsis, contiguous terminators -- each gets its own refiner.


📦 Installation

Ready to do some cybernetic boundary shearing? Let's get you set up quickly and painlessly.

The Quick & Easy Way

The simplest way to get started is with pip:

pip install yasbd-lib

[!TIP] Termux (Android)

No Rust toolchain? Install pydantic-core pre-built wheels first, then retry:

pip install typing-extensions
pip install pydantic-core --index-url https://termux-user-repository.github.io/pypi/
pip install "pydantic>=2.12.4,<2.13"

That's it! Blade is armed.

The From-Source Way

Prefer building from source? Clone and install manually for full control:

git clone https://github.com/speedyk-005/yasbd-lib.git
cd yasbd
pip install .

(But honestly, the pip way is way easier.)

Want to Help Make yasbd Even Better?

That's awesome. See Contributing Guide.


📟 Usage

[!TIP] Looking for the pysbd drop-in replacement? Jump straight to the Adapter section.

Initialization

from yasbd.boundary_detector import BoundaryDetector
# Or from yasbd import BoundaryDetector

# Basic setup
detector = BoundaryDetector(lang="en")

# With all options (so far.)
detector = BoundaryDetector(
	# ISO 639 code (e.g., en, fr, es, ...). Defaults to `en`.
	# https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
    lang="fr",

    # Don't split inside them. (It won't protect block quotes) Defaults to `True`.
    # https://en.wikipedia.org/wiki/Block_quotation
    preserve_quote_and_paren=True,

    # Enable verbose logging. Defaults to `False`.
    verbose=True,
)

Switching languages at runtime is a property set:

detector.lang = "es"

The rule module loads lazily on first access. Switching mid-stream reimports the module and rebinds the pattern cache. Zero config, no restarts needed.

Boundary detection

detect() tells you where each sentence stops. Integer offsets into the original string. No copies, no slicing, no bookkeeping. Feed them to whatever downstream logic you already have.

Two detection modes:

  • absolute: (default) offsets count from the start of the entire input stream.
  • relative: offsets reset at each paragraph boundary. A ParagraphEOF sentinel signals the gap between paragraphs.
# absolute mode (default)
res= list(detector.detect('She turned to him, "This is great." She held the book out to show him.'))
print(res)
# [35, 70]

# relative mode with paragraph break
detector.lang = "es"
res = list(detector.detect(
	"El Sr. García llegó ayer. La Sra. López también.\n\nVéase la pág. 55 del libro.",
	relative=True,
))
print(res)
# [25, 48, ParagraphEOF, 27]

Segmentation

If you do not want to manage boundary offsets yourself (and who would?), segment() wraps detect() with string slicing. It yields sentences as strings, one at a time. By default it strips leading and trailing whitespace and drops empty results. Set preserve_whitespace=True to keep original spacing around boundaries.

detector.lang = "en"

# Basic sentence splitting
res = list(detector.segment("Hello world. How are you? I am fine."))
print(res)
# ['Hello world.', 'How are you?', 'I am fine.']

# Multi-paragraph with whitespace preserved
res = list(detector.segment(
    "First para.\nStill first.\n\nSecond para.\nFinished.",
    preserve_whitespace=True,
))
print(res)
# ['First para.', '\nStill first.', '\n\n', 'Second para.', '\nFinished.']

[!TIP] Inputs & streamingdetect() and segment() accept plain strings, open file streams (TextIOBase), or a StreamCleaner. Both are generators: they yield results lazily without loading the entire source into memory. Internally, the text is split on blank lines into paragraphs, and each paragraph is processed independently with offset tracking between them.

[!TIP] ParagraphStream — yasbd uses ParagraphStream internally to split text into paragraph blocks. You can import it directly if you need paragraph-level processing in your own code:

from yasbd.utils.paragraph_stream import ParagraphStream

for para in ParagraphStream(text):
    print(para)  # each paragraph block

You can also skip empty lines with skip_empty_lines=True

Cleaner

OCRd a PDF or scraping noisy text? StreamCleaner normalizes paragraphs before they hit the detector:

from yasbd.utils.cleaner import StreamCleaner

cleaner = StreamCleaner("Hello  world.   This is  messy.")
list(cleaner)
# ['Hello world. This is messy.']

It collapses multiple spaces, strips HTML tags, removes page numbers, re-joins hyphenated words split across lines, and more. Pass it directly to detect() or segment() instead of a string.

Adapter

Migrating from pysbd? Swap the import and keep your pipeline:

# Before: from pysbd import Segmenter
from yasbd.utils.pysbd_adapter import Segmenter

seg = Segmenter(language="ja")
res = seg.segment('田中さんは「準備は完了しました」そう言って部屋を出た。U.S.A.の経済政策は非常に複雑です。')
print(res)
# ['田中さんは「準備は完了しました」そう言って部屋を出た。', 'U.S.A.の経済政策は非常に複雑です。']

Same API surface. Same Segmenter class. Same segment() method. Even the TextSpan class is there with sent, start, and end fields, hurray. It also handles leading whitespace the way pysbd expects it (trailing on the previous sentence instead of leading on the next).


🗺 Features & Roadmap

  • Regex caching (compile once per language class)
  • Drop-in pysbd adapter (same API, no pipeline changes)
  • StreamCleaner for OCR'd and noisy text
  • spaCy integration
  • 22+ language targets
  • CLI tool
  • REST API for remote boundary detection

🏁 Benchmarks

Tested against 6 competitors (pysbd, sentencex, sentsplit, nupunkt, blingfire, sentence-splitter) across 5 languages and 7 edge cases: compound abbreviations, CJK quotes, newline wrapping, chat logs, URLs, and more.

TL;DR: yasbd ranked #1 in accuracy across almost every test, while staying competitive on speed as pure Python. blingfire is faster but brittle. pysbd and sentencex shred French abbreviations. nupunkt has an 11-second cold start. Full results, terminal output, and a performance graph can be found in benchmarks/


📜 Last note

yasbd is maintained by speedyk-005. Licensed under Mozilla Public License 2.0 — you can use it in proprietary software, but modifications to the source files must stay open under MPL 2.0. Contributions are welcome; see CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yasbd_lib-0.1.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yasbd_lib-0.1.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file yasbd_lib-0.1.0.tar.gz.

File metadata

  • Download URL: yasbd_lib-0.1.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yasbd_lib-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a75773565ca6a3f9df6b9aa873fcc34fe63688f9ba8a7487fb9d7db0a0c36e0
MD5 bc6c85232b0a3fd610411612f4a6818e
BLAKE2b-256 ec0a7843af03ccc92ab18373e82749deb4d0ded551d5a8b752cb8c8393dd34c1

See more details on using hashes here.

File details

Details for the file yasbd_lib-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: yasbd_lib-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for yasbd_lib-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a9971d6fc6a5147b37608e287f9447b0f70ff9acdbc726ea50111b357616d1c2
MD5 3c66220a243c3f941b798365672902a3
BLAKE2b-256 2833a4e3b8b1b744391a20e1e273803009985a822808fbae24334b5fd16f64fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page