Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
Project description
Real-Time Sentence Detection
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
Table of Contents
Features
- Generates sentences from a stream of text in real-time.
- Customizable to finetune/balance speed vs reliability.
- Option to clean the output by removing links and emojis from the detected sentences.
- Easy to configure and integrate.
Installation
$ pip install stream2sentence
Usage
Pass a generator of characters or text chunks to generate_sentences()
to get a generator of sentences in return.
Here's a basic example:
from stream2sentence import generate_sentences
# Dummy generator for demonstration
def dummy_generator():
yield "This is a sentence. And here's another! Yet, "
yield "there's more. This ends now."
for sentence in generate_sentences(dummy_generator()):
print(sentence)
Configuration
The generate_sentences()
function has the following optional parameters:
-
context_size
: Context size for sentence detection.
This controls how much context is looked at to detect sentence boundaries. It determines the number of characters around a potential delimiter (like a period) that are considered when detecting sentence boundaries. A larger context size allows more reliable sentence boundary detection, but requires buffering more characters before emitting a sentence.
Default is 10 characters. Increasing this can help detect sentences more accurately, at the cost of added latency. -
minimum_sentence_length
: Minimum length of a sentence to be detected.
Specifies the minimum number of characters a chunk of text should have before it's considered a potential sentence. This ensures that very short sequences of characters are not mistakenly identified as sentences.Shorter fragments are ignored and kept in the buffer.
Default is 8 characters. Increasing this avoids emitting very short sentence fragments, at the cost of potentially missing some sentences. -
remove_links
: Option to remove links from the output sentences.
When set to True, this option enables the function to identify and remove HTTP/HTTPS hyperlinks from the emitted output sentences. This helps clean up the output by avoiding unnecessary links.
Default is False. Set to True if links are not required in the output. -
remove_emojis
: Option to remove emojis from the output sentences.
If True, any Unicode emoji characters are identified and removed from the emitted output sentences. This can help to clean up the output.
Default is False. Set to True if emojis are not required in the output.
Contributing
Any Contributions you make are welcome and greatly appreciated.
- Fork the Project.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
). - Commit your Changes (
git commit -m 'Add some AmazingFeature'
). - Push to the Branch (
git push origin feature/AmazingFeature
). - Open a Pull Request.
License
This project is licensed under the MIT License. For more details, see the LICENSE
file.
Project created and maintained by Kolja Beigel.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file stream2sentence-0.1.0.tar.gz
.
File metadata
- Download URL: stream2sentence-0.1.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c44b645a75a1121fcf0507f43c7b603b5759e331ad80e6224285cfa1e7fcaca0 |
|
MD5 | 9df47aac49caccf8b000f707869db692 |
|
BLAKE2b-256 | e8866988db4e062f831bc28c86ede19bc2fa478205c0a60548c808b45df56611 |
File details
Details for the file stream2sentence-0.1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: stream2sentence-0.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e73ff562bc1e86f866f39426edc64f8a8debf465bf436eccc1ed7cd355c85a9 |
|
MD5 | 293487d0f58fe0cdf9162a36c67540bd |
|
BLAKE2b-256 | ca247394bfd59c4fce11101493cd0e5c57430641c710f2c3ba80d479429a200d |