Skip to main content

Fast and Efficient Sentence Segmentation

Project description

Fast Sentence Segmentation

PyPI version Python versions License: MIT spaCy

Fast and efficient sentence segmentation using spaCy. Handles complex edge cases like abbreviations (Dr., Mr., etc.), quoted text, and multi-paragraph documents.

Features

  • Paragraph-aware segmentation: Returns sentences grouped by paragraph
  • Abbreviation handling: Correctly handles "Dr.", "Mr.", "etc." without false splits
  • Cached processing: LRU cache for repeated text processing
  • Flexible output: Nested lists (by paragraph) or flattened list of sentences
  • Bullet point & numbered list normalization: Cleans common list formats

Installation

pip install fast-sentence-segment

After installation, download the spaCy model:

python -m spacy download en_core_web_sm

Quick Start

from fast_sentence_segment import segment_text

text = "Here is a Dr. who says something. And then again, what else? I don't know. Do you?"

results = segment_text(text)
# Returns: [['Here is a Dr. who says something.', 'And then again, what else?', "I don't know.", 'Do you?']]

Usage

Basic Segmentation

The segment_text function returns a list of lists, where each inner list represents a paragraph containing its sentences:

from fast_sentence_segment import segment_text

text = """First paragraph here. It has two sentences.

Second paragraph starts here. This one also has multiple sentences. And a third."""

results = segment_text(text)
# Returns:
# [
#     ['First paragraph here.', 'It has two sentences.'],
#     ['Second paragraph starts here.', 'This one also has multiple sentences.', 'And a third.']
# ]

Flattened Output

If you don't need paragraph boundaries, use the flatten parameter:

results = segment_text(text, flatten=True)
# Returns: ['First paragraph here.', 'It has two sentences.', 'Second paragraph starts here.', ...]

Direct Segmenter Access

For more control, use the Segmenter class directly:

from fast_sentence_segment import Segmenter

segmenter = Segmenter()
results = segmenter.input_text("Your text here.")

API Reference

Function Parameters Returns Description
segment_text() input_text: str, flatten: bool = False list Main entry point for segmentation
Segmenter.input_text() input_text: str list[list[str]] Cached paragraph-aware segmentation

Why Nested Lists?

The segmentation process preserves document structure by segmenting into both paragraphs and sentences. Each outer list represents a paragraph, and each inner list contains that paragraph's sentences. This is useful for:

  • Document structure analysis
  • Paragraph-level processing
  • Maintaining original text organization

Use flatten=True when you only need sentences without paragraph context.

Requirements

  • Python 3.8.5+
  • spaCy 3.5.3
  • en_core_web_sm spaCy model

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (make test)
  4. Commit your changes
  5. Push to the branch
  6. Open a Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_sentence_segment-1.1.8.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_sentence_segment-1.1.8-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file fast_sentence_segment-1.1.8.tar.gz.

File metadata

  • Download URL: fast_sentence_segment-1.1.8.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for fast_sentence_segment-1.1.8.tar.gz
Algorithm Hash digest
SHA256 6991ef7fca8cb9d40c6139c4926f9d7500acd0e288f0b23468a588d9d7aa46fd
MD5 ed073ef0dea58714a0c165e195ae5579
BLAKE2b-256 856fd8e0e98a0aa91e18a84c6aea4fa85c855620863b2a89c1bc8c84f61080c1

See more details on using hashes here.

File details

Details for the file fast_sentence_segment-1.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for fast_sentence_segment-1.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 166093d743d74484a2634b4b9c852700f6a86b91286add1992de5f200ad4e33b
MD5 c2598e337f1025bc6049cd37b37e355b
BLAKE2b-256 eb28716817f107f8420a90f318bebfdf79f1a5e46e7267ad67ca78fe7a4d696e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page