A package for sentence splitting using a pre-trained transformer model.
Project description
Sentence Splitter
A Python package for sentence splitting using a pre-trained transformer model.
Description
Sentence Splitter is a Python package that provides accurate sentence segmentation using a transformer-based token classification model. The model is bundled with the package, eliminating the need for additional downloads or configurations. It's designed to handle long texts efficiently and supports GPU acceleration if available.
Features
- Transformer-Based Model: Leverages a pre-trained transformer model for high-accuracy sentence splitting.
- Bundled Model: The model and tokenizer are included with the package—no extra downloads required.
- Easy to Use: Simple quick integration into your projects.
- Handles Long Texts: Efficiently processes long texts by splitting them into manageable chunks.
- GPU Acceleration: Automatically utilizes CUDA if available for faster processing.
Installation
Install the package via pip to install without PyTorch (if you want your own PyTorch installation):
pip install iges-sentence-splitter
or to install with gpu-enabled PyTorch:
pip install iges-sentence-splitter[torch]
Requirements
- Python 3.6 or higher
torchtransformers
Usage
Basic Example
from sentence_splitter.splitter import SentenceSplitter
# Initialize the splitter
splitter = SentenceSplitter()
# Input text
text = "This is a test. Here is another sentence. And yet another one!"
# Get sentences
sentences = splitter.split(text)
print(sentences)
Output:
['This is a test.', 'Here is another sentence.', 'And yet another one!']
Processing Long Texts
The split method can handle long texts by splitting them into chunks. You can adjust the parameters as needed:
sentences = splitter.split(
text,
max_seq_len=512, # Maximum sequence length for each chunk
stride=100, # Overlap between chunks to preserve context
batch_size=24 # Number of chunks to process at once
)
Reference
SentenceSplitter
A class for splitting text into sentences using a pre-trained transformer model.
Initialization
splitter = SentenceSplitter(device=None, efficient_mode=False)
- Parameters:
device(str, optional): The device to run the model on ('cuda'or'cpu'). Defaults to'cuda'if available, otherwise'cpu'.efficient_mode(bool, optional): Whether to run the model in 8-bit precision for faster computing
Methods
-
split(text, max_seq_len=512, stride=100, batch_size=4)Splits the input text into sentences.
- Parameters:
text(str): The text to split.max_seq_len(int, optional): Maximum sequence length for the model. Defaults to512.stride(int, optional): Number of tokens to overlap between chunks. Defaults to100.batch_size(int, optional): Number of chunks to process simultaneously. Defaults to24.
- Returns:
- List[str]: A list of sentences.
- Parameters:
How It Works
The package uses a token classification model that labels each token as:
- B: Beginning of a sentence.
- E: End of a sentence.
- I: Inside a sentence.
By processing the tokens and their predicted labels, the splitter reconstructs the sentences accurately, even in complex texts.
Example: Splitting Complex Text
text = """
Despite the rain, the match continued. Players were determined; fans were cheering.
"Unbelievable!" shouted the commentator. It's a night to remember.
"""
sentences = splitter.split(text)
for i, sentence in enumerate(sentences, 1):
print(f"Sentence {i}: {sentence}")
Output:
Sentence 1: Despite the rain, the match continued.
Sentence 2: Players were determined; fans were cheering.
Sentence 3: "Unbelievable!" shouted the commentator.
Sentence 4: It's a night to remember.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
- Kathryn Chapman
- Email: kathryn.chapman@iges.com
Acknowledgments
- Hugging Face Transformers for the transformer models.
- PyTorch for the deep learning framework.
Contact
For any questions or suggestions, feel free to reach out via email.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iges_sentence_splitter-0.1.15.tar.gz.
File metadata
- Download URL: iges_sentence_splitter-0.1.15.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.64.1 urllib3/1.26.5 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e7edba7bfe8c760f335524e44047dba8399f633e26968646bba3d0b26dad4b8
|
|
| MD5 |
9b8f0000a0928bb1e3c834d859163f3a
|
|
| BLAKE2b-256 |
4c0655932889321e0e03576239f771794e54220c280872ad69f1e02a4ccb0136
|
File details
Details for the file iges_sentence_splitter-0.1.15-py3-none-any.whl.
File metadata
- Download URL: iges_sentence_splitter-0.1.15-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.64.1 urllib3/1.26.5 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cb2217b1b988556278a6c697b5d64eacbf89a00fb849bbf186d44258cef3bfe
|
|
| MD5 |
bc8cb74c8cb32123a23745246087ac1b
|
|
| BLAKE2b-256 |
ec7f1ddb6cc18afc2ee0976aa9acc75e4a71fe1e4977ac94166921b525a431f4
|