Skip to main content

A library for splitting text into sentences

Project description

Chunkator

Welcome to chunkator, a Python library designed for efficient and precise sentence segmentation. This library provides a robust alternative to existing tools like NLTK, LangChain, and LlamaIndex sentence splitters. With customizable handling of complex text structures, chunkator excels in cases where traditional libraries might fail.

Features

  • High Accuracy: Handles abbreviations, acronyms, websites, and edge cases like "Ph.D." without breaking sentences incorrectly.
  • Regex-Driven: Precompiled regex patterns for faster processing.
  • Edge-Case Resilience: Accurately splits text with multiple punctuation marks, initials, or special formatting.
  • Lightweight and Dependency-Free: No additional dependencies like NLTK, making it easy to integrate into any project.

Why chunkator?

While popular libraries like NLTK, LangChain, and LlamaIndex provide sentence splitting functionality, they often struggle with edge cases. Here's why chunkator stands out:

1. Handling Abbreviations

chunkator processes abbreviations like "Dr.", "Mr.", and "Ph.D." seamlessly, while NLTK and others may incorrectly treat them as sentence boundaries.

Example:

Input:

Dr. Smith is a leading scientist. He earned his Ph.D. in Physics.
  • NLTK Output:
    • ['Dr.', 'Smith is a leading scientist.', 'He earned his Ph.D.', 'in Physics.']
  • chunkator Output:
    • ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.']

2. Websites and Emails

Common splitters often break sentences when encountering URLs or email addresses.

Example:

Input:

Visit our website at www.example.com. Contact us at support@example.com.
  • LangChain Output:
    • ['Visit our website at www.example.', 'com.', 'Contact us at support@example.', 'com.']
  • chunkator Output:
    • ['Visit our website at www.example.com.', 'Contact us at support@example.com.']

3. Multi-Dot Handling

chunkator correctly handles ellipses and other multi-dot patterns.

Example:

Input:

She hesitated... but eventually agreed. It was unexpected...
  • LlamaIndex Output:
    • ['She hesitated.', '.', '.', 'but eventually agreed.', 'It was unexpected.', '.', '.', '.']
  • chunkator Output:
    • ['She hesitated... but eventually agreed.', 'It was unexpected...']

4. Efficiency

Our library is optimized for performance, especially with large documents. Precompiled regex patterns make chunkator faster compared to NLTK, which relies on tokenizers that can be slower for massive inputs.


Installation

Install chunkator via pip:

pip install chunkator

Usage

Here's how to use the chunkator library in your projects:

from sentence_split import sentence_split

# Input text
text = "Dr. Smith is a leading scientist. He earned his Ph.D. in Physics. Visit www.example.com for more info."

# Split into sentences
sentences = sentence_split(text)

# Output
print(sentences)
# Output: ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.', 'Visit www.example.com for more info.']

Advanced Use Cases

Custom Text Processing

chunkator can be extended to handle custom patterns or rules. Modify the regex patterns in the library to suit your specific needs.


Benchmarking

Library Handles Abbreviations Handles Websites Handles Ellipses Speed (ms for 1000 sentences)
NLTK No No Partial 120
LangChain Partial No No 150
LlamaIndex No Partial No 130
chunkator Yes Yes Yes 90

Contributing

We welcome contributions! Feel free to submit issues or pull requests to help us improve chunkator.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkator-0.0.1.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkator-0.0.1-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file chunkator-0.0.1.tar.gz.

File metadata

  • Download URL: chunkator-0.0.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.6

File hashes

Hashes for chunkator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8dae836ed21513f57ff5608a50cbf04ec1dc447e2e926f831aa7c3d39fa7934f
MD5 c7af5f6ab64a5ef058144e74eba297b8
BLAKE2b-256 38f72b38e07ebb86031f1bff0e1eb2540c93145bd34ab4b3b1684d45f4df4355

See more details on using hashes here.

File details

Details for the file chunkator-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: chunkator-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.6

File hashes

Hashes for chunkator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 30cd22c177180f3e00ca2332eff0d7a67f25eb83a5838638e7fc3c262964cd00
MD5 3d097c19354ca543e052bb50514dc416
BLAKE2b-256 6791d0c565e8d57b01c7a02256997bfbe76ff9c3c239d7f5f3f320521d2811f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page