Skip to main content

Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

Project description

Real-Time Sentence Detection

Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

Table of Contents

Features

  • Generates sentences from a stream of text in real-time.
  • Customizable to finetune/balance speed vs reliability.
  • Option to clean the output by removing links and emojis from the detected sentences.
  • Easy to configure and integrate.

Installation

pip install stream2sentence

Usage

Pass a generator of characters or text chunks to generate_sentences() to get a generator of sentences in return.

Here's a basic example:

from stream2sentence import generate_sentences

# Dummy generator for demonstration
def dummy_generator():
    yield "This is a sentence. And here's another! Yet, "
    yield "there's more. This ends now."

for sentence in generate_sentences(dummy_generator()):
    print(sentence)

This will output:

This is a sentence.
And here's another!
Yet, there's more.
This ends now.

One main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.

Configuration

The generate_sentences() function has the following parameters:

  • generator: Input character generator. Iterator that emits chunks of text. These chunks can be of any size, and they'll be processed one by one to extract sentences from them. It forms the primary source from which the function reads and generates sentences.

  • quick_yield_single_sentence_fragment: Whether to return a sentence fragment as fast as possible.
    This is a feature for realtime speech synthesis. In some use cases you want to audio stream a minimal chunk of text as fast as possible, even when it means to synthesize mid-sentence. In this case you set this flag to True which will yield a synthesizable sentence fragment as early as possible.

  • context_size: Context size for sentence detection.
    This controls how much context is looked at to detect sentence boundaries. It determines the number of characters around a potential delimiter (like a period) that are considered when detecting sentence boundaries. A larger context size allows more reliable sentence boundary detection, but requires buffering more characters before emitting a sentence.
    Default is 10 characters. Increasing this can help detect sentences more accurately, at the cost of added latency.

  • minimum_sentence_length: Minimum length of a sentence to be detected.
    Specifies the minimum number of characters a chunk of text should have before it's considered a potential sentence. This ensures that very short sequences of characters are not mistakenly identified as sentences.Shorter fragments are ignored and kept in the buffer.
    Default is 8 characters. Increasing this avoids emitting very short sentence fragments, at the cost of potentially missing some sentences.

  • quick_yield_single_sentence_fragment: Yield a sentence fragment quickly for real-time applications. When set to True, this option allows the function to quickly yield a sentence fragment as soon as it identifies a potential sentence delimiter, without waiting for further context. This is useful for applications like real-time speech synthesis where there's a need for immediate feedback even if the entire sentence isn't complete. Default is False. Set to True for faster but potentially less accurate sentence yields.

  • cleanup_text_links: Option to remove links from the output sentences.
    When set to True, this option enables the function to identify and remove HTTP/HTTPS hyperlinks from the emitted output sentences. This helps clean up the output by avoiding unnecessary links.
    Default is False. Set to True if links are not required in the output.

  • cleanup_text_emojis: Option to remove emojis from the output sentences.
    If True, any Unicode emoji characters are identified and removed from the emitted output sentences. This can help to clean up the output.
    Default is False. Set to True if emojis are not required in the output.

  • log_characters: Option to log characters to the console. When enabled, each character processed by the function is printed to the console. This is mainly for debugging purposes to observe the flow of characters through the function. Default is False. Set to True for a visual representation of characters being processed. Example: allows printing llm output to console when using stream2sentence to prepare input generation for text to speech synthesis.

Contributing

Any Contributions you make are welcome and greatly appreciated.

  1. Fork the Project.
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature).
  3. Commit your Changes (git commit -m 'Add some AmazingFeature').
  4. Push to the Branch (git push origin feature/AmazingFeature).
  5. Open a Pull Request.

License

This project is licensed under the MIT License. For more details, see the LICENSE file.


Project created and maintained by Kolja Beigel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream2sentence-0.1.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

stream2sentence-0.1.1-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file stream2sentence-0.1.1.tar.gz.

File metadata

  • Download URL: stream2sentence-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for stream2sentence-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cde9de7c6aea45b36d00113b6126559fcf2c98d105513546221fc414826b1d37
MD5 4acc321145582bc7850d21ea19b75ac5
BLAKE2b-256 769c90082d0741aeba8ffb4c9699d456a5f5a2b79bcc040c673664ad9cf35336

See more details on using hashes here.

File details

Details for the file stream2sentence-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for stream2sentence-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d77f017119701bf532eb614e4daa83501234b7217f558137867031d9faf7e836
MD5 f35f719d734812fd61d47512984fd709
BLAKE2b-256 2d829c44a154c5d22d6a9435c60ecb1372a3ff23da3c1b53f50517305ceb7697

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page