Skip to main content

A fast and dynamic python package for converting paragraphs into sentences with customizable thresholds for adaptive sentence segmentation.

Project description

sentok

Sentok is a fast and dynamic Python package for converting paragraphs into sentences. It offers customizable thresholds for adaptive sentence segmentation and is built on top of pandas for high performance and easy adjustment. The package allows you to easily convert paragraphs into a list of sentences or a DataFrame with probability columns.

Features

  • High Performance: Efficient handling of large texts.
  • Dynamic Configuration: Customizable parameters and regular expressions.
  • Simple Logic: Easy to understand and extend.

Installation

Via pip

To install the latest version directly from the GitHub repository, use:

pip install sentok

Or

pip install git+https://github.com/kothiyarajesh/sentok.git

Building from Source

  1. Clone the repository:

    git clone https://github.com/kothiyarajesh/sentok.git
    
  2. Navigate to the project directory:

    cd sentok
    
  3. Install the package:

    python setup.py install
    

Usage

Python Script

Here’s a simple example of how to use the sentok library in a Python script:

import sentok

# Display current weights used by the tokenizer
# Uncomment the following line to view the current weights in use:
# print(sentok.get_weights())

# Adjust weights only if necessary for specific use cases
# For example, updating the set of start characters:
# sentok.set_weights({'start_chars': list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')})

# Sample text for sentence tokenization
text = """Natural language processing (NLP) is a captivating domain that merges computer science, artificial intelligence, and linguistics. It empowers computers to comprehend, interpret, and produce human language in a manner that is both useful and insightful. NLP finds application in various fields, such as text analysis, speech recognition, and machine translation. For example, advanced language models like GPT-3 have showcased exceptional skills in generating text that resembles human writing and in answering queries. As technology progresses, NLP continues to advance, enhancing its precision and expanding its scope of applications."""

# Tokenize the sample text into sentences using the default threshold of 0.65
# Adjust the threshold as needed based on your text's quality.
sentences = sentok.sent_tokenize(text, 0.64)

# Print each extracted sentence
for sentence in sentences:
    print('->', sentence)

# Print the total number of sentences extracted
print('Total Sentences:', len(sentences))

# Obtain a DataFrame with tokenization features for further analysis or model training:
df = sentok.get_sent_tokenize_df(text)
print(df)

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentok-0.1.1.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

sentok-0.1.1-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file sentok-0.1.1.tar.gz.

File metadata

  • Download URL: sentok-0.1.1.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.4

File hashes

Hashes for sentok-0.1.1.tar.gz
Algorithm Hash digest
SHA256 981bdeb3b5a0cfc8708f58c9dc9ecaa2d830bd8c7df3c34ebe709e1cbb7e6c5b
MD5 e7fc38864e6beea7126b6d6fbf8ef156
BLAKE2b-256 a8d9e2bfcc996431254b1d81988477292cf72ab4dd86bc10402d4be891ebbad0

See more details on using hashes here.

File details

Details for the file sentok-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sentok-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.4

File hashes

Hashes for sentok-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b7d39105ee90a3a568f4c42776bf9b7b9452137dbc4cbd93e7890f5777673f73
MD5 b47bcf8657944063f763100efda02a8c
BLAKE2b-256 9402ac9a4da465816a66aa36a06c4f73a97a998f34fc76bea3f0ffb2cc3de7c0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page