Skip to main content

A simple tool to split text.

Project description

phrasplit

A Python library for splitting text into sentences, clauses, or paragraphs using spaCy NLP. Designed for audiobook creation and text-to-speech processing.

Features

  • Sentence splitting: Intelligent sentence boundary detection using spaCy
  • Clause splitting: Split sentences at commas for natural pause points
  • Paragraph splitting: Split text at double newlines
  • Long line splitting: Break long lines at sentence/clause boundaries
  • Abbreviation handling: Correctly handles Mr., Dr., U.S.A., etc.
  • Ellipsis support: Preserves ellipses without incorrect splitting

Installation

pip install phrasplit

You'll also need to download a spaCy language model:

python -m spacy download en_core_web_sm

Quick Start

Python API

from phrasplit import split_sentences, split_clauses, split_paragraphs, split_long_lines

# Split text into sentences
text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']

# Split sentences into comma-separated parts (for audiobook pauses)
text = "I like coffee, and I like tea."
clauses = split_clauses(text)
# ['I like coffee,', 'and I like tea.']

# Split text into paragraphs
text = "First paragraph.\n\nSecond paragraph."
paragraphs = split_paragraphs(text)
# ['First paragraph.', 'Second paragraph.']

# Split long lines at natural boundaries
text = "This is a very long sentence that needs to be split."
lines = split_long_lines(text, max_length=30)

Command Line Interface

# Split into sentences
phrasplit sentences input.txt -o output.txt

# Split into clauses
phrasplit clauses input.txt -o output.txt

# Split into paragraphs
phrasplit paragraphs input.txt -o output.txt

# Split long lines (default max 80 characters)
phrasplit longlines input.txt -o output.txt --max-length 60

# Use a different spaCy model
phrasplit sentences input.txt --model en_core_web_lg

# Read from stdin (pipe or redirect)
echo "Hello world. This is a test." | phrasplit sentences
cat input.txt | phrasplit clauses -o output.txt

# Explicit stdin with dash
phrasplit sentences - < input.txt

API Reference

split_sentences(text, language_model="en_core_web_sm")

Split text into sentences using spaCy's sentence boundary detection.

Parameters:

  • text: Input text string
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of sentences

split_clauses(text, language_model="en_core_web_sm")

Split text into comma-separated parts. Useful for creating natural pause points in audiobook/TTS applications.

Parameters:

  • text: Input text string
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of clauses (comma stays at end of each part)

split_paragraphs(text)

Split text into paragraphs at double newlines.

Parameters:

  • text: Input text string

Returns: List of paragraphs

split_long_lines(text, max_length, language_model="en_core_web_sm")

Split lines exceeding max_length at sentence/clause boundaries.

Parameters:

  • text: Input text string
  • max_length: Maximum line length in characters
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of lines, each within max_length

Use Cases

Audiobook Creation

Split text at commas to create natural pause points for text-to-speech:

from phrasplit import split_clauses

text = "When the sun rose, the birds began to sing, and the day started."
parts = split_clauses(text)
# ['When the sun rose,', 'the birds began to sing,', 'and the day started.']

Subtitle Generation

Split long lines to fit subtitle constraints:

from phrasplit import split_long_lines

text = "This is a very long sentence that would not fit on a single subtitle line."
lines = split_long_lines(text, max_length=42)

Text Processing Pipelines

from phrasplit import split_paragraphs, split_sentences

text = open("book.txt").read()

for paragraph in split_paragraphs(text):
    for sentence in split_sentences(paragraph):
        process(sentence)

Requirements

  • Python 3.9+
  • spaCy 3.5+
  • click 8.0+
  • rich 13.0+

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phrasplit-0.1.0.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phrasplit-0.1.0-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file phrasplit-0.1.0.tar.gz.

File metadata

  • Download URL: phrasplit-0.1.0.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 23582f7ff56c4b41f5bf5d17f70e64089dc7a8c595c67fe11057aba607e063fa
MD5 f7e6631f7e5577c851f29a7514464c92
BLAKE2b-256 5e1f5a17c5265f22e4954a37aed02ba9a01f7e37537643869da4946ebbef2078

See more details on using hashes here.

File details

Details for the file phrasplit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: phrasplit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 62bd94cbd2aeed3ecb1408f30b65ab94b176bd77e120b373a4dfab7b9eb2146c
MD5 e1c7138114be9f9694b3454ba10d6466
BLAKE2b-256 25dcaea2f2c944e813c60255afd2d35e1194379180b70aca333294d96dda57bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page