Skip to main content

A package for sentence splitting using a pre-trained transformer model.

Project description

Sentence Splitter

A Python package for sentence splitting using a pre-trained transformer model.

Description

Sentence Splitter is a Python package that provides accurate sentence segmentation using a transformer-based token classification model. The model is bundled with the package, eliminating the need for additional downloads or configurations. It's designed to handle long texts efficiently and supports GPU acceleration if available.

Features

  • Transformer-Based Model: Leverages a pre-trained transformer model for high-accuracy sentence splitting.
  • Bundled Model: The model and tokenizer are included with the package—no extra downloads required.
  • Easy to Use: Simple quick integration into your projects.
  • Handles Long Texts: Efficiently processes long texts by splitting them into manageable chunks.
  • GPU Acceleration: Automatically utilizes CUDA if available for faster processing.

Installation

Install the package via pip to install without PyTorch (if you want your own PyTorch installation):

pip install iges-sentence-splitter

or to install with gpu-enabled PyTorch:

pip install iges-sentence-splitter[torch]

Requirements

  • Python 3.6 or higher
  • torch
  • transformers

Usage

Basic Example

from sentence_splitter.splitter import SentenceSplitter

# Initialize the splitter
splitter = SentenceSplitter()

# Input text
text = "This is a test. Here is another sentence. And yet another one!"

# Get sentences
sentences = splitter.split(text)

print(sentences)

Output:

['This is a test.', 'Here is another sentence.', 'And yet another one!']

Processing Long Texts

The split method can handle long texts by splitting them into chunks. You can adjust the parameters as needed:

sentences = splitter.split(
    text,
    max_seq_len=512,   # Maximum sequence length for each chunk
    stride=100,        # Overlap between chunks to preserve context
    batch_size=24       # Number of chunks to process at once
)

Reference

SentenceSplitter

A class for splitting text into sentences using a pre-trained transformer model.

Initialization

splitter = SentenceSplitter(device=None, efficient_mode=False)
  • Parameters:
    • device (str, optional): The device to run the model on ('cuda' or 'cpu'). Defaults to 'cuda' if available, otherwise 'cpu'.
    • efficient_mode (bool, optional): Whether to run the model in 8-bit precision for faster computing

Methods

  • split(text, max_seq_len=512, stride=100, batch_size=4)

    Splits the input text into sentences.

    • Parameters:
      • text (str): The text to split.
      • max_seq_len (int, optional): Maximum sequence length for the model. Defaults to 512.
      • stride (int, optional): Number of tokens to overlap between chunks. Defaults to 100.
      • batch_size (int, optional): Number of chunks to process simultaneously. Defaults to 24.
    • Returns:
      • List[str]: A list of sentences.

How It Works

The package uses a token classification model that labels each token as:

  • B: Beginning of a sentence.
  • E: End of a sentence.
  • I: Inside a sentence.

By processing the tokens and their predicted labels, the splitter reconstructs the sentences accurately, even in complex texts.

Example: Splitting Complex Text

text = """
Despite the rain, the match continued. Players were determined; fans were cheering. 
"Unbelievable!" shouted the commentator. It's a night to remember.
"""

sentences = splitter.split(text)

for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")

Output:

Sentence 1: Despite the rain, the match continued.
Sentence 2: Players were determined; fans were cheering.
Sentence 3: "Unbelievable!" shouted the commentator.
Sentence 4: It's a night to remember.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Acknowledgments

Contact

For any questions or suggestions, feel free to reach out via email.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iges_sentence_splitter-0.1.14.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iges_sentence_splitter-0.1.14-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file iges_sentence_splitter-0.1.14.tar.gz.

File metadata

  • Download URL: iges_sentence_splitter-0.1.14.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.64.1 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for iges_sentence_splitter-0.1.14.tar.gz
Algorithm Hash digest
SHA256 c57e56d22fddc89f7ba88f8a1d198dd822c444cca2469cfae6291a0d87f11334
MD5 cfe835aa874c430a4ee487d7e11762f4
BLAKE2b-256 890bb3852ea2fb1fcac253b47086efe8e095ff8ae9bbc8141c4b591706fce1ef

See more details on using hashes here.

File details

Details for the file iges_sentence_splitter-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: iges_sentence_splitter-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.64.1 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for iges_sentence_splitter-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 7962d0484e8c4ae1c9effd4b6e338bbf79e6ed3c2d61fa44dd4026495ebfc049
MD5 322c5a746ecefc2fb5df04829ade39e2
BLAKE2b-256 11b577db7ef472492b6f970adbaed2649eba9e7f3c5b00e4e8c32a39bd54e8fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page