Skip to main content

A simple tool to split text.

Project description

PyPI - Version PyPI - Python Version PyPI - Downloads codecov

phrasplit

A Python library for splitting text into sentences, clauses, or paragraphs using spaCy NLP. Designed for audiobook creation and text-to-speech processing.

Features

  • Sentence splitting: Intelligent sentence boundary detection using spaCy
  • Clause splitting: Split sentences at commas for natural pause points
  • Paragraph splitting: Split text at double newlines
  • Long line splitting: Break long lines at sentence/clause boundaries
  • Abbreviation handling: Correctly handles Mr., Dr., U.S.A., etc.
  • Ellipsis support: Preserves ellipses without incorrect splitting

Installation

pip install phrasplit

You'll also need to download a spaCy language model:

python -m spacy download en_core_web_sm

Quick Start

Python API

from phrasplit import split_sentences, split_clauses, split_paragraphs, split_long_lines

# Split text into sentences
text = "Dr. Smith is here. She has a Ph.D. in Chemistry."
sentences = split_sentences(text)
# ['Dr. Smith is here.', 'She has a Ph.D. in Chemistry.']

# Split sentences into comma-separated parts (for audiobook pauses)
text = "I like coffee, and I like tea."
clauses = split_clauses(text)
# ['I like coffee,', 'and I like tea.']

# Split text into paragraphs
text = "First paragraph.\n\nSecond paragraph."
paragraphs = split_paragraphs(text)
# ['First paragraph.', 'Second paragraph.']

# Split long lines at natural boundaries
text = "This is a very long sentence that needs to be split."
lines = split_long_lines(text, max_length=30)

Command Line Interface

# Split into sentences
phrasplit sentences input.txt -o output.txt

# Split into clauses
phrasplit clauses input.txt -o output.txt

# Split into paragraphs
phrasplit paragraphs input.txt -o output.txt

# Split long lines (default max 80 characters)
phrasplit longlines input.txt -o output.txt --max-length 60

# Use a different spaCy model
phrasplit sentences input.txt --model en_core_web_lg

# Read from stdin (pipe or redirect)
echo "Hello world. This is a test." | phrasplit sentences
cat input.txt | phrasplit clauses -o output.txt

# Explicit stdin with dash
phrasplit sentences - < input.txt

API Reference

split_sentences(text, language_model="en_core_web_sm")

Split text into sentences using spaCy's sentence boundary detection.

Parameters:

  • text: Input text string
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of sentences

split_clauses(text, language_model="en_core_web_sm")

Split text into comma-separated parts. Useful for creating natural pause points in audiobook/TTS applications.

Parameters:

  • text: Input text string
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of clauses (comma stays at end of each part)

split_paragraphs(text)

Split text into paragraphs at double newlines.

Parameters:

  • text: Input text string

Returns: List of paragraphs

split_long_lines(text, max_length, language_model="en_core_web_sm")

Split lines exceeding max_length at sentence/clause boundaries.

Parameters:

  • text: Input text string
  • max_length: Maximum line length in characters (must be >= 1)
  • language_model: spaCy model to use (default: "en_core_web_sm")

Returns: List of lines, each within max_length (except single words exceeding limit)

Raises: ValueError if max_length is less than 1

Use Cases

Audiobook Creation

Split text at commas to create natural pause points for text-to-speech:

from phrasplit import split_clauses

text = "When the sun rose, the birds began to sing, and the day started."
parts = split_clauses(text)
# ['When the sun rose,', 'the birds began to sing,', 'and the day started.']

Subtitle Generation

Split long lines to fit subtitle constraints:

from phrasplit import split_long_lines

text = "This is a very long sentence that would not fit on a single subtitle line."
lines = split_long_lines(text, max_length=42)

Text Processing Pipelines

from phrasplit import split_paragraphs, split_sentences

text = open("book.txt").read()

for paragraph in split_paragraphs(text):
    for sentence in split_sentences(paragraph):
        process(sentence)

Requirements

  • Python 3.9+
  • spaCy 3.5+
  • click 8.0+
  • rich 13.0+

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phrasplit-0.1.2.tar.gz (81.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phrasplit-0.1.2-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file phrasplit-0.1.2.tar.gz.

File metadata

  • Download URL: phrasplit-0.1.2.tar.gz
  • Upload date:
  • Size: 81.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6ba2903add3b96c0bc520b618d5a3306f31b4c1e2a7705d8e44fae8a5ebee49f
MD5 784bb83515c523e0e6d60f0a4972a64d
BLAKE2b-256 c8dfc16fa85350dffef01cb7873a255f3f4f289469fd75f1d1bef55be9731491

See more details on using hashes here.

File details

Details for the file phrasplit-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: phrasplit-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for phrasplit-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ce1db5f00646760a555aa43880ffcaad1ba99af9a2ca12c5ed578bf68a933ec9
MD5 ade52061fee7091ef23646aa5db0f6af
BLAKE2b-256 120b1f8d0c58a7309ece0682f38f9920d9966630a19e4d43028438b5d70d52ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page