Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

These details have not been verified by PyPI

Project links

Homepage

Project description

Real-Time Sentence Detection

Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

Hint: If you're interested in state-of-the-art voice solutions you might also want to have a look at Linguflex, the original project from which stream2sentence is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

Features
Installation
Usage
Configuration
Contributing
License

Features

Generates sentences from a stream of text in real-time.
Customizable to finetune/balance speed vs reliability.
Option to clean the output by removing links and emojis from the detected sentences.
Easy to configure and integrate.

Installation

pip install stream2sentence

Usage

Pass a generator of characters or text chunks to generate_sentences() to get a generator of sentences in return.

Here's a basic example:

from stream2sentence import generate_sentences

# Dummy generator for demonstration
def dummy_generator():
    yield "This is a sentence. And here's another! Yet, "
    yield "there's more. This ends now."

for sentence in generate_sentences(dummy_generator()):
    print(sentence)

This will output:

This is a sentence.
And here's another!
Yet, there's more.
This ends now.

One main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.

Configuration

The generate_sentences() function offers various parameters to fine-tune its behavior:

Core Parameters

generator: Iterator[str]
- The primary input source, yielding chunks of text to be processed.
- Can be any iterator that emits text chunks of any size.
context_size: int = 12
- Number of characters considered for sentence boundary detection.
- Larger values improve accuracy but may increase latency.
- Default: 12 characters
context_size_look_overhead: int = 12
- Additional characters to examine beyond context_size for sentence splitting.
- Enhances sentence detection accuracy.
- Default: 12 characters
minimum_sentence_length: int = 10
- Minimum character count for a text chunk to be considered a sentence.
- Shorter fragments are buffered until this threshold is met.
- Default: 10 characters
minimum_first_fragment_length: int = 10
- Minimum character count required for the first sentence fragment.
- Ensures the initial output meets a specified length threshold.
- Default: 10 characters

Yield Control

These parameters control how quickly and frequently the generator yields sentence fragments:

quick_yield_single_sentence_fragment: bool = False
- When True, yields the first fragment of the first sentence as quickly as possible.
- Useful for getting immediate output in real-time applications like speech synthesis.
- Default: False
quick_yield_for_all_sentences: bool = False
- When True, yields the first fragment of every sentence as quickly as possible.
- Extends the quick yield behavior to all sentences, not just the first one.
- Automatically sets quick_yield_single_sentence_fragment to True.
- Default: False
quick_yield_every_fragment: bool = False
- When True, yields every fragment of every sentence as quickly as possible.
- Provides the most granular output, yielding fragments as soon as they're detected.
- Automatically sets both quick_yield_for_all_sentences and quick_yield_single_sentence_fragment to True.
- Default: False

Text Cleanup

cleanup_text_links: bool = False
- When True, removes hyperlinks from the output sentences.
- Default: False
cleanup_text_emojis: bool = False
- When True, removes emoji characters from the output sentences.
- Default: False

Tokenization

tokenize_sentences: Callable = None
- Custom function for sentence tokenization.
- If None, uses the default tokenizer specified by tokenizer.
- Default: None
tokenizer: str = "nltk"
- Specifies the tokenizer to use. Options: "nltk" or "stanza"
- Default: "nltk"
language: str = "en"
- Language setting for the tokenizer.
- Use "en" for English or "multilingual" for Stanza tokenizer.
- Default: "en"

Debugging and Fine-tuning

log_characters: bool = False
- When True, logs each processed character to the console.
- Useful for debugging or monitoring real-time processing.
- Default: False
sentence_fragment_delimiters: str = ".?!;:,\nâ€¦)]}ã€‚-"
- Characters considered as potential sentence fragment delimiters.
- Used for quick yielding of sentence fragments.
- Default: ".?!;:,\nâ€¦)]}ã€‚-"
full_sentence_delimiters: str = ".?!\nâ€¦ã€‚"
- Characters considered as full sentence delimiters.
- Used for more definitive sentence boundary detection.
- Default: ".?!\nâ€¦ã€‚"
force_first_fragment_after_words: int = 15
- Forces the yield of the first sentence fragment after this many words.
- Ensures timely output even with long opening sentences.
- Default: 15 words

Contributing

Any Contributions you make are welcome and greatly appreciated.

Fork the Project.
Create your Feature Branch (git checkout -b feature/AmazingFeature).
Commit your Changes (git commit -m 'Add some AmazingFeature').
Push to the Branch (git push origin feature/AmazingFeature).
Open a Pull Request.

License

This project is licensed under the MIT License. For more details, see the LICENSE file.

Project created and maintained by Kolja Beigel.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.7

Nov 7, 2024

0.2.6

Nov 7, 2024

0.2.5

Jul 21, 2024

0.2.4

Jul 17, 2024

0.2.3

Mar 21, 2024

0.2.2

Dec 1, 2023

0.2.1

Dec 1, 2023

0.2.0

Nov 28, 2023

0.1.6

Nov 4, 2023

0.1.5

Nov 4, 2023

0.1.4

Nov 3, 2023

0.1.3

Nov 3, 2023

0.1.1

Aug 24, 2023

0.1.0

Aug 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream2sentence-0.2.7.tar.gz (12.3 kB view details)

Uploaded Nov 7, 2024 Source

Built Distribution

stream2sentence-0.2.7-py3-none-any.whl (8.3 kB view details)

Uploaded Nov 7, 2024 Python 3

File details

Details for the file stream2sentence-0.2.7.tar.gz.

File metadata

Download URL: stream2sentence-0.2.7.tar.gz
Upload date: Nov 7, 2024
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for stream2sentence-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`a16dc466e29ad96aa86e5ab0c4699dc38a4848400f6d889172182e7002077adc`
MD5	`398a9ceef13efb1e2a903b7d679f1170`
BLAKE2b-256	`e14ca51c9ec6ce9ced48ef2f3449dc14f1283cd97073de0283787cca7a6513e9`

See more details on using hashes here.

File details

Details for the file stream2sentence-0.2.7-py3-none-any.whl.

File metadata

Download URL: stream2sentence-0.2.7-py3-none-any.whl
Upload date: Nov 7, 2024
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for stream2sentence-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d53586f6f74b3252a702abf0e3cd1ec6d9b6b9db23c7f2693882ccde4f981ce0`
MD5	`75d17f4a132c9af62e58beb410199592`
BLAKE2b-256	`f8ec3f899ea8faa319bff2ed6be6d88eb8caf0908d62aff5b0a23acc50a2f2f0`