A simple tool to split text.

These details have not been verified by PyPI

Project links

Homepage

Project description

PyPI - Python Version PyPI - Downloads

phrasplit

A Python library for splitting text into sentences, clauses, or paragraphs using spaCy NLP. Designed for audiobook creation and text-to-speech processing.

Features

Sentence splitting: Intelligent sentence boundary detection using spaCy
Clause splitting: Split sentences at commas for natural pause points
Paragraph splitting: Split text at double newlines
Hierarchical splitting: Split text with paragraph/sentence position tracking
Long line splitting: Break long lines at sentence/clause boundaries
Abbreviation handling: Correctly handles Mr., Dr., U.S.A., etc.
Ellipsis support: Preserves ellipses without incorrect splitting

Installation

pip install phrasplit

You'll also need to download a spaCy language model:

python -m spacy download en_core_web_sm

Quick Start

Python API

from phrasplit import split_sentences, split_clauses, split_paragraphs, split_long_lines

# Split text into sentences
text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']

# Split sentences into comma-separated parts (for audiobook pauses)
text = "I like coffee, and I like tea."
clauses = split_clauses(text)
# ['I like coffee,', 'and I like tea.']

# Split text into paragraphs
text = "First paragraph.\n\nSecond paragraph."
paragraphs = split_paragraphs(text)
# ['First paragraph.', 'Second paragraph.']

# Split long lines at natural boundaries
text = "This is a very long sentence that needs to be split."
lines = split_long_lines(text, max_length=30)

Hierarchical Splitting with Position Tracking

For audiobook generation where you need different pause lengths between paragraphs, sentences, and clauses, use split_text():

from phrasplit import split_text, Segment

# Split into sentences with paragraph tracking
text = "First sentence. Second sentence.\n\nNew paragraph here."
segments = split_text(text, mode="sentence")

for seg in segments:
    print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")
# P0 S0: First sentence.
# P0 S1: Second sentence.
# P1 S0: New paragraph here.

# Detect paragraph changes for longer pauses
for i, seg in enumerate(segments):
    if i > 0 and seg.paragraph != segments[i-1].paragraph:
        print("--- paragraph break (add longer pause) ---")
    print(seg.text)

Available modes:

"paragraph": Returns paragraphs (sentence=None)
"sentence": Returns sentences with paragraph index
"clause": Returns clauses with paragraph and sentence indices

Command Line Interface

# Split into sentences
phrasplit sentences input.txt -o output.txt

# Split into clauses
phrasplit clauses input.txt -o output.txt

# Split into paragraphs
phrasplit paragraphs input.txt -o output.txt

# Split long lines (default max 80 characters)
phrasplit longlines input.txt -o output.txt --max-length 60

# Use a different spaCy model
phrasplit sentences input.txt --model en_core_web_lg

# Read from stdin (pipe or redirect)
echo "Hello world. This is a test." | phrasplit sentences
cat input.txt | phrasplit clauses -o output.txt

# Explicit stdin with dash
phrasplit sentences - < input.txt

API Reference

`split_sentences(text, language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)`

Split text into sentences using spaCy's sentence boundary detection.

Parameters:

text: Input text string
language_model: spaCy model to use (default: "en_core_web_sm")
apply_corrections: Apply post-processing corrections for URLs and abbreviations (default: True)
split_on_colon: Treat colons as sentence terminators (default: True)

Returns: List of sentences

`split_clauses(text, language_model="en_core_web_sm")`

Split text into comma-separated parts. Useful for creating natural pause points in audiobook/TTS applications.

Parameters:

text: Input text string
language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of clauses (comma stays at end of each part)

`split_paragraphs(text)`

Split text into paragraphs at double newlines.

Parameters:

text: Input text string

Returns: List of paragraphs

`split_text(text, mode="sentence", language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)`

Split text into segments with hierarchical position information.

Parameters:

text: Input text string
mode: Splitting mode - "paragraph", "sentence", or "clause"
language_model: spaCy model to use (default: "en_core_web_sm")
apply_corrections: Apply post-processing corrections (default: True)
split_on_colon: Treat colons as sentence terminators (default: True)

Returns: List of Segment namedtuples with fields:

text: The segment text
paragraph: Paragraph index (0-based)
sentence: Sentence index within paragraph (0-based), None for paragraph mode

`split_long_lines(text, max_length, language_model="en_core_web_sm")`

Split lines exceeding max_length at sentence/clause boundaries.

Parameters:

text: Input text string
max_length: Maximum line length in characters (must be >= 1)
language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of lines, each within max_length (except single words exceeding limit)

Raises: ValueError if max_length is less than 1

Use Cases

Audiobook Creation

Split text with paragraph awareness for different pause lengths:

from phrasplit import split_text

text = """When the sun rose, the birds began to sing.

A new day had started. The adventure continues."""

segments = split_text(text, mode="clause")

for i, seg in enumerate(segments):
    # Add longer pause between paragraphs
    if i > 0 and seg.paragraph != segments[i-1].paragraph:
        add_pause(duration=1.0)  # Long pause for paragraph
    # Add medium pause between sentences
    elif i > 0 and seg.sentence != segments[i-1].sentence:
        add_pause(duration=0.5)  # Medium pause for sentence
    else:
        add_pause(duration=0.2)  # Short pause for clause

    synthesize_speech(seg.text)

Subtitle Generation

Split long lines to fit subtitle constraints:

from phrasplit import split_long_lines

text = "This is a very long sentence that would not fit on a single subtitle line."
lines = split_long_lines(text, max_length=42)

Text Processing Pipelines

from phrasplit import split_paragraphs, split_sentences

text = open("book.txt").read()

for paragraph in split_paragraphs(text):
    for sentence in split_sentences(paragraph):
        process(sentence)

Requirements

Python 3.9+
spaCy 3.5+
click 8.0+
rich 13.0+

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.2

Jan 23, 2026

0.3.1

Jan 23, 2026

0.3.0

Jan 19, 2026

0.2.2

Jan 11, 2026

0.2.1

Jan 11, 2026

This version

0.2.0

Dec 30, 2025

0.1.2

Dec 29, 2025

0.1.1

Dec 29, 2025

0.1.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phrasplit-0.2.0.tar.gz (99.6 kB view details)

Uploaded Dec 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phrasplit-0.2.0-py3-none-any.whl (22.4 kB view details)

Uploaded Dec 30, 2025 Python 3

File details

Details for the file phrasplit-0.2.0.tar.gz.

File metadata

Download URL: phrasplit-0.2.0.tar.gz
Upload date: Dec 30, 2025
Size: 99.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a56bb71ccf061f531ac2ac33a5d431ac51f74658f3e07b5dfa7b974c352fe389`
MD5	`05226f15a1fef173c479eb102b7ce8ec`
BLAKE2b-256	`2ca191a472d1eea18a105e2cc413cf6926b59fb7c5c871be9e17e9d1dce081d6`

See more details on using hashes here.

File details

Details for the file phrasplit-0.2.0-py3-none-any.whl.

File metadata

Download URL: phrasplit-0.2.0-py3-none-any.whl
Upload date: Dec 30, 2025
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4660cd4eb33b437a1b5e8f04d00145735920079dc0ef9c5434a69d9acb6e6d1b`
MD5	`0ed865ce34ec848a87f24e14f194f080`
BLAKE2b-256	`99c58ea60c619b7f5b58fa697b1c6a056324ac8abc9469171d84af65a9ba3b39`

See more details on using hashes here.

phrasplit 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

phrasplit

Features

Installation

Quick Start

Python API

Hierarchical Splitting with Position Tracking

Command Line Interface

API Reference

split_sentences(text, language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)

split_clauses(text, language_model="en_core_web_sm")

split_paragraphs(text)

split_text(text, mode="sentence", language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)

split_long_lines(text, max_length, language_model="en_core_web_sm")

Use Cases

Audiobook Creation

Subtitle Generation

Text Processing Pipelines

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`split_sentences(text, language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)`

`split_clauses(text, language_model="en_core_web_sm")`

`split_paragraphs(text)`

`split_text(text, mode="sentence", language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)`

`split_long_lines(text, max_length, language_model="en_core_web_sm")`