Skip to main content

Fast and Efficient Sentence Segmentation

Project description

Fast Sentence Segmentation

PyPI version Python versions License: MIT spaCy

Fast and efficient sentence segmentation using spaCy with surgical post-processing fixes. Handles complex edge cases like abbreviations (Dr., Mr., etc.), ellipses, quoted text, and multi-paragraph documents.

Why This Library?

  1. Keep it local: LLM API calls cost money and send your data to third parties. Run sentence segmentation entirely on your machine.
  2. spaCy perfected: spaCy is a great local model, but it makes mistakes. This library fixes most of spaCy's shortcomings.

Features

  • Paragraph-aware segmentation: Returns sentences grouped by paragraph
  • Abbreviation handling: Correctly handles "Dr.", "Mr.", "etc.", "p.m.", "a.m." without false splits
  • Ellipsis preservation: Keeps ... intact while detecting sentence boundaries
  • Question/exclamation splitting: Properly splits on ? and ! followed by capital letters
  • Cached processing: LRU cache for repeated text processing
  • Flexible output: Nested lists (by paragraph) or flattened list of sentences
  • Bullet point & numbered list normalization: Cleans common list formats
  • CLI tool: Command-line interface for quick segmentation

Installation

pip install fast-sentence-segment

After installation, download the spaCy model:

python -m spacy download en_core_web_sm

Quick Start

from fast_sentence_segment import segment_text

text = "Do you like Dr. Who? I prefer Dr. Strange! Mr. T is also cool."

results = segment_text(text, flatten=True)
[
  "Do you like Dr. Who?",
  "I prefer Dr. Strange!",
  "Mr. T is also cool."
]

Notice how "Dr. Who?" stays together as a single sentence—the library correctly recognizes that a title followed by a single-word name ending in ? or ! is a name reference, not a sentence boundary.

Usage

Basic Segmentation

The segment_text function returns a list of lists, where each inner list represents a paragraph containing its sentences:

from fast_sentence_segment import segment_text

text = """Gandalf spoke softly. "All we have to decide is what to do with the time given us."

Frodo nodded. The weight of the Ring pressed against his chest."""

results = segment_text(text)
[
  [
    "Gandalf spoke softly.",
    "\"All we have to decide is what to do with the time given us.\"."
  ],
  [
    "Frodo nodded.",
    "The weight of the Ring pressed against his chest."
  ]
]

Flattened Output

If you don't need paragraph boundaries, use the flatten parameter:

text = "At 9 a.m. the hobbits set out. By 3 p.m. they reached Rivendell. Mr. Frodo was exhausted."

results = segment_text(text, flatten=True)
[
  "At 9 a.m. the hobbits set out.",
  "By 3 p.m. they reached Rivendell.",
  "Mr. Frodo was exhausted."
]

Direct Segmenter Access

For more control, use the Segmenter class directly:

from fast_sentence_segment import Segmenter

segmenter = Segmenter()
results = segmenter.input_text("Your text here.")

Command Line Interface

Segment text directly from the terminal:

# Direct text input
echo "Have you seen Dr. Who? It's brilliant!" | segment
Have you seen Dr. Who?
It's brilliant!
# Numbered output
segment -n "Gandalf paused... You shall not pass! The Balrog roared."
1. Gandalf paused...
2. You shall not pass!
3. The Balrog roared.
# From file
segment -f silmarillion.txt

API Reference

Function Parameters Returns Description
segment_text() input_text: str, flatten: bool = False list Main entry point for segmentation
Segmenter.input_text() input_text: str list[list[str]] Cached paragraph-aware segmentation

CLI Options

Option Description
text Text to segment (positional argument)
-f, --file Read text from file
-n, --numbered Number output lines

Why Nested Lists?

The segmentation process preserves document structure by segmenting into both paragraphs and sentences. Each outer list represents a paragraph, and each inner list contains that paragraph's sentences. This is useful for:

  • Document structure analysis
  • Paragraph-level processing
  • Maintaining original text organization

Use flatten=True when you only need sentences without paragraph context.

Requirements

  • Python 3.9+
  • spaCy 3.8+
  • en_core_web_sm spaCy model

How It Works

This library uses spaCy for initial sentence segmentation, then applies surgical post-processing fixes for cases where spaCy's default behavior is incorrect:

  1. Pre-processing: Normalize numbered lists, preserve ellipses with placeholders
  2. spaCy segmentation: Use spaCy's sentence boundary detection
  3. Post-processing: Split on abbreviation boundaries, handle ?/! + capital patterns
  4. Denormalization: Restore placeholders to original text

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (make test)
  4. Commit your changes
  5. Push to the branch
  6. Open a Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_sentence_segment-1.2.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_sentence_segment-1.2.1-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file fast_sentence_segment-1.2.1.tar.gz.

File metadata

  • Download URL: fast_sentence_segment-1.2.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for fast_sentence_segment-1.2.1.tar.gz
Algorithm Hash digest
SHA256 a28bb67597aee0bcc9efd8216c0e86d2749028024f376f5ef8645c04a845937c
MD5 8cb3ed74471087fce38fc06bc7080b8b
BLAKE2b-256 d8373dcf18abee150b595cd1d81d69b1ea58d254ae0fe330aa490ad4af795895

See more details on using hashes here.

File details

Details for the file fast_sentence_segment-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for fast_sentence_segment-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7eeb5ebce41d203e69c2b822edc065bccba140618263f28fbdaaf425e3414638
MD5 ad1ff196cf8ef254446597360dac624c
BLAKE2b-256 aa4792bf754faa0529f8e646b095ea329e24f4393359fffe168495747136cae1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page