A simple tool to split text.
Project description
phrasplit
A Python library for splitting text into sentences, clauses, or paragraphs using spaCy NLP. Designed for audiobook creation and text-to-speech processing.
Features
- Sentence splitting: Intelligent sentence boundary detection using spaCy
- Clause splitting: Split sentences at commas for natural pause points
- Paragraph splitting: Split text at double newlines
- Long line splitting: Break long lines at sentence/clause boundaries
- Abbreviation handling: Correctly handles Mr., Dr., U.S.A., etc.
- Ellipsis support: Preserves ellipses without incorrect splitting
Installation
pip install phrasplit
You'll also need to download a spaCy language model:
python -m spacy download en_core_web_sm
Quick Start
Python API
from phrasplit import split_sentences, split_clauses, split_paragraphs, split_long_lines
# Split text into sentences
text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']
# Split sentences into comma-separated parts (for audiobook pauses)
text = "I like coffee, and I like tea."
clauses = split_clauses(text)
# ['I like coffee,', 'and I like tea.']
# Split text into paragraphs
text = "First paragraph.\n\nSecond paragraph."
paragraphs = split_paragraphs(text)
# ['First paragraph.', 'Second paragraph.']
# Split long lines at natural boundaries
text = "This is a very long sentence that needs to be split."
lines = split_long_lines(text, max_length=30)
Command Line Interface
# Split into sentences
phrasplit sentences input.txt -o output.txt
# Split into clauses
phrasplit clauses input.txt -o output.txt
# Split into paragraphs
phrasplit paragraphs input.txt -o output.txt
# Split long lines (default max 80 characters)
phrasplit longlines input.txt -o output.txt --max-length 60
# Use a different spaCy model
phrasplit sentences input.txt --model en_core_web_lg
# Read from stdin (pipe or redirect)
echo "Hello world. This is a test." | phrasplit sentences
cat input.txt | phrasplit clauses -o output.txt
# Explicit stdin with dash
phrasplit sentences - < input.txt
API Reference
split_sentences(text, language_model="en_core_web_sm")
Split text into sentences using spaCy's sentence boundary detection.
Parameters:
text: Input text stringlanguage_model: spaCy model to use (default: "en_core_web_sm")
Returns: List of sentences
split_clauses(text, language_model="en_core_web_sm")
Split text into comma-separated parts. Useful for creating natural pause points in audiobook/TTS applications.
Parameters:
text: Input text stringlanguage_model: spaCy model to use (default: "en_core_web_sm")
Returns: List of clauses (comma stays at end of each part)
split_paragraphs(text)
Split text into paragraphs at double newlines.
Parameters:
text: Input text string
Returns: List of paragraphs
split_long_lines(text, max_length, language_model="en_core_web_sm")
Split lines exceeding max_length at sentence/clause boundaries.
Parameters:
text: Input text stringmax_length: Maximum line length in characters (must be >= 1)language_model: spaCy model to use (default: "en_core_web_sm")
Returns: List of lines, each within max_length (except single words exceeding limit)
Raises: ValueError if max_length is less than 1
Use Cases
Audiobook Creation
Split text at commas to create natural pause points for text-to-speech:
from phrasplit import split_clauses
text = "When the sun rose, the birds began to sing, and the day started."
parts = split_clauses(text)
# ['When the sun rose,', 'the birds began to sing,', 'and the day started.']
Subtitle Generation
Split long lines to fit subtitle constraints:
from phrasplit import split_long_lines
text = "This is a very long sentence that would not fit on a single subtitle line."
lines = split_long_lines(text, max_length=42)
Text Processing Pipelines
from phrasplit import split_paragraphs, split_sentences
text = open("book.txt").read()
for paragraph in split_paragraphs(text):
for sentence in split_sentences(paragraph):
process(sentence)
Requirements
- Python 3.9+
- spaCy 3.5+
- click 8.0+
- rich 13.0+
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phrasplit-0.1.2.tar.gz.
File metadata
- Download URL: phrasplit-0.1.2.tar.gz
- Upload date:
- Size: 81.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ba2903add3b96c0bc520b618d5a3306f31b4c1e2a7705d8e44fae8a5ebee49f
|
|
| MD5 |
784bb83515c523e0e6d60f0a4972a64d
|
|
| BLAKE2b-256 |
c8dfc16fa85350dffef01cb7873a255f3f4f289469fd75f1d1bef55be9731491
|
File details
Details for the file phrasplit-0.1.2-py3-none-any.whl.
File metadata
- Download URL: phrasplit-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce1db5f00646760a555aa43880ffcaad1ba99af9a2ca12c5ed578bf68a933ec9
|
|
| MD5 |
ade52061fee7091ef23646aa5db0f6af
|
|
| BLAKE2b-256 |
120b1f8d0c58a7309ece0682f38f9920d9966630a19e4d43028438b5d70d52ff
|