Skip to main content

A simple tool to split text.

Project description

PyPI - Version PyPI - Python Version PyPI - Downloads codecov

phrasplit

A Python library for splitting text into sentences, clauses, or paragraphs using spaCy NLP. Designed for audiobook creation and text-to-speech processing.

Features

  • Sentence splitting: Intelligent sentence boundary detection using spaCy
  • Clause splitting: Split sentences at commas for natural pause points
  • Paragraph splitting: Split text at double newlines
  • Hierarchical splitting: Split text with paragraph/sentence position tracking
  • Long line splitting: Break long lines at sentence/clause boundaries
  • Abbreviation handling: Correctly handles Mr., Dr., U.S.A., etc.
  • Ellipsis support: Preserves ellipses without incorrect splitting

Installation

pip install phrasplit

You'll also need to download a spaCy language model:

python -m spacy download en_core_web_sm

Quick Start

Python API

from phrasplit import split_sentences, split_clauses, split_paragraphs, split_long_lines

# Split text into sentences
text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']

# Split sentences into comma-separated parts (for audiobook pauses)
text = "I like coffee, and I like tea."
clauses = split_clauses(text)
# ['I like coffee,', 'and I like tea.']

# Split text into paragraphs
text = "First paragraph.\n\nSecond paragraph."
paragraphs = split_paragraphs(text)
# ['First paragraph.', 'Second paragraph.']

# Split long lines at natural boundaries
text = "This is a very long sentence that needs to be split."
lines = split_long_lines(text, max_length=30)

Hierarchical Splitting with Position Tracking

For audiobook generation where you need different pause lengths between paragraphs, sentences, and clauses, use split_text():

from phrasplit import split_text, Segment

# Split into sentences with paragraph tracking
text = "First sentence. Second sentence.\n\nNew paragraph here."
segments = split_text(text, mode="sentence")

for seg in segments:
    print(f"P{seg.paragraph} S{seg.sentence}: {seg.text}")
# P0 S0: First sentence.
# P0 S1: Second sentence.
# P1 S0: New paragraph here.

# Detect paragraph changes for longer pauses
for i, seg in enumerate(segments):
    if i > 0 and seg.paragraph != segments[i-1].paragraph:
        print("--- paragraph break (add longer pause) ---")
    print(seg.text)

Available modes:

  • "paragraph": Returns paragraphs (sentence=None)
  • "sentence": Returns sentences with paragraph index
  • "clause": Returns clauses with paragraph and sentence indices

Command Line Interface

# Split into sentences
phrasplit sentences input.txt -o output.txt

# Split into clauses
phrasplit clauses input.txt -o output.txt

# Split into paragraphs
phrasplit paragraphs input.txt -o output.txt

# Split long lines (default max 80 characters)
phrasplit longlines input.txt -o output.txt --max-length 60

# Use a different spaCy model
phrasplit sentences input.txt --model en_core_web_lg

# Read from stdin (pipe or redirect)
echo "Hello world. This is a test." | phrasplit sentences
cat input.txt | phrasplit clauses -o output.txt

# Explicit stdin with dash
phrasplit sentences - < input.txt

API Reference

split_sentences(text, language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)

Split text into sentences using spaCy's sentence boundary detection.

Parameters:

  • text: Input text string
  • language_model: spaCy model to use (default: "en_core_web_sm")
  • apply_corrections: Apply post-processing corrections for URLs and abbreviations (default: True)
  • split_on_colon: Treat colons as sentence terminators (default: True)

Returns: List of sentences

split_clauses(text, language_model="en_core_web_sm")

Split text into comma-separated parts. Useful for creating natural pause points in audiobook/TTS applications.

Parameters:

  • text: Input text string
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of clauses (comma stays at end of each part)

split_paragraphs(text)

Split text into paragraphs at double newlines.

Parameters:

  • text: Input text string

Returns: List of paragraphs

split_text(text, mode="sentence", language_model="en_core_web_sm", apply_corrections=True, split_on_colon=True)

Split text into segments with hierarchical position information.

Parameters:

  • text: Input text string
  • mode: Splitting mode - "paragraph", "sentence", or "clause"
  • language_model: spaCy model to use (default: "en_core_web_sm")
  • apply_corrections: Apply post-processing corrections (default: True)
  • split_on_colon: Treat colons as sentence terminators (default: True)

Returns: List of Segment namedtuples with fields:

  • text: The segment text
  • paragraph: Paragraph index (0-based)
  • sentence: Sentence index within paragraph (0-based), None for paragraph mode

split_long_lines(text, max_length, language_model="en_core_web_sm")

Split lines exceeding max_length at sentence/clause boundaries.

Parameters:

  • text: Input text string
  • max_length: Maximum line length in characters (must be >= 1)
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of lines, each within max_length (except single words exceeding limit)

Raises: ValueError if max_length is less than 1

Use Cases

Audiobook Creation

Split text with paragraph awareness for different pause lengths:

from phrasplit import split_text

text = """When the sun rose, the birds began to sing.

A new day had started. The adventure continues."""

segments = split_text(text, mode="clause")

for i, seg in enumerate(segments):
    # Add longer pause between paragraphs
    if i > 0 and seg.paragraph != segments[i-1].paragraph:
        add_pause(duration=1.0)  # Long pause for paragraph
    # Add medium pause between sentences
    elif i > 0 and seg.sentence != segments[i-1].sentence:
        add_pause(duration=0.5)  # Medium pause for sentence
    else:
        add_pause(duration=0.2)  # Short pause for clause

    synthesize_speech(seg.text)

Subtitle Generation

Split long lines to fit subtitle constraints:

from phrasplit import split_long_lines

text = "This is a very long sentence that would not fit on a single subtitle line."
lines = split_long_lines(text, max_length=42)

Text Processing Pipelines

from phrasplit import split_paragraphs, split_sentences

text = open("book.txt").read()

for paragraph in split_paragraphs(text):
    for sentence in split_sentences(paragraph):
        process(sentence)

Requirements

  • Python 3.9+
  • spaCy 3.5+
  • click 8.0+
  • rich 13.0+

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phrasplit-0.2.0.tar.gz (99.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phrasplit-0.2.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file phrasplit-0.2.0.tar.gz.

File metadata

  • Download URL: phrasplit-0.2.0.tar.gz
  • Upload date:
  • Size: 99.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a56bb71ccf061f531ac2ac33a5d431ac51f74658f3e07b5dfa7b974c352fe389
MD5 05226f15a1fef173c479eb102b7ce8ec
BLAKE2b-256 2ca191a472d1eea18a105e2cc413cf6926b59fb7c5c871be9e17e9d1dce081d6

See more details on using hashes here.

File details

Details for the file phrasplit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: phrasplit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4660cd4eb33b437a1b5e8f04d00145735920079dc0ef9c5434a69d9acb6e6d1b
MD5 0ed865ce34ec848a87f24e14f194f080
BLAKE2b-256 99c58ea60c619b7f5b58fa697b1c6a056324ac8abc9469171d84af65a9ba3b39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page