Fast and Efficient Sentence Segmentation
Project description
Fast Sentence Segmentation
Fast and efficient sentence segmentation using spaCy. Handles complex edge cases like abbreviations (Dr., Mr., etc.), quoted text, and multi-paragraph documents.
Features
- Paragraph-aware segmentation: Returns sentences grouped by paragraph
- Abbreviation handling: Correctly handles "Dr.", "Mr.", "etc." without false splits
- Cached processing: LRU cache for repeated text processing
- Flexible output: Nested lists (by paragraph) or flattened list of sentences
- Bullet point & numbered list normalization: Cleans common list formats
Installation
pip install fast-sentence-segment
After installation, download the spaCy model:
python -m spacy download en_core_web_sm
Quick Start
from fast_sentence_segment import segment_text
text = "Here is a Dr. who says something. And then again, what else? I don't know. Do you?"
results = segment_text(text)
# Returns: [['Here is a Dr. who says something.', 'And then again, what else?', "I don't know.", 'Do you?']]
Usage
Basic Segmentation
The segment_text function returns a list of lists, where each inner list represents a paragraph containing its sentences:
from fast_sentence_segment import segment_text
text = """First paragraph here. It has two sentences.
Second paragraph starts here. This one also has multiple sentences. And a third."""
results = segment_text(text)
# Returns:
# [
# ['First paragraph here.', 'It has two sentences.'],
# ['Second paragraph starts here.', 'This one also has multiple sentences.', 'And a third.']
# ]
Flattened Output
If you don't need paragraph boundaries, use the flatten parameter:
results = segment_text(text, flatten=True)
# Returns: ['First paragraph here.', 'It has two sentences.', 'Second paragraph starts here.', ...]
Direct Segmenter Access
For more control, use the Segmenter class directly:
from fast_sentence_segment import Segmenter
segmenter = Segmenter()
results = segmenter.input_text("Your text here.")
API Reference
| Function | Parameters | Returns | Description |
|---|---|---|---|
segment_text() |
input_text: str, flatten: bool = False |
list |
Main entry point for segmentation |
Segmenter.input_text() |
input_text: str |
list[list[str]] |
Cached paragraph-aware segmentation |
Why Nested Lists?
The segmentation process preserves document structure by segmenting into both paragraphs and sentences. Each outer list represents a paragraph, and each inner list contains that paragraph's sentences. This is useful for:
- Document structure analysis
- Paragraph-level processing
- Maintaining original text organization
Use flatten=True when you only need sentences without paragraph context.
Requirements
- Python 3.8.5+
- spaCy 3.5.3
- en_core_web_sm spaCy model
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests (
make test) - Commit your changes
- Push to the branch
- Open a Pull Request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_sentence_segment-1.1.8.tar.gz.
File metadata
- Download URL: fast_sentence_segment-1.1.8.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6991ef7fca8cb9d40c6139c4926f9d7500acd0e288f0b23468a588d9d7aa46fd
|
|
| MD5 |
ed073ef0dea58714a0c165e195ae5579
|
|
| BLAKE2b-256 |
856fd8e0e98a0aa91e18a84c6aea4fa85c855620863b2a89c1bc8c84f61080c1
|
File details
Details for the file fast_sentence_segment-1.1.8-py3-none-any.whl.
File metadata
- Download URL: fast_sentence_segment-1.1.8-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
166093d743d74484a2634b4b9c852700f6a86b91286add1992de5f200ad4e33b
|
|
| MD5 |
c2598e337f1025bc6049cd37b37e355b
|
|
| BLAKE2b-256 |
eb28716817f107f8420a90f318bebfdf79f1a5e46e7267ad67ca78fe7a4d696e
|