Skip to main content

A Python class that takes in long text as input and divides it into shorter chunks no longer than a specified length.

Project description

TextChunker

TextChunker is a Python class that takes in long text as input and divides it into shorter chunks no longer than a specified length. The purpose of this project is to provide a simple and useful tool for data processing tasks such as natural language processing and information extraction.

Install

pip intsll text_chunker

Usage

You can use the TextChunker class in your Python code as follows:

from text_chunker import TextChunker

# Create a new TextChunker object with a maximum chunk length of 50 characters
chunker = TextChunker(maxlen=1000)

# Chunk a long text string into smaller chunks
text = "This is a long text string..."
for chunk in chunker.chunk(text):
    print(chunk)

The chunk method attempts to split paragraphs first while keeping chunk length below maxlen. If a paragraph is longer than maxlen, the method attempts to split the paragraph into sentences. If a sentence is longer than maxlen, it is split into smaller chunks no longer than maxlen.

There are also functions called paragraphs and sentences that divide the text into paragraphs and sentences, respectively.

from text_chunker import paragraphs

for p in paragraphs(text):
    print(p)
from text_chunker import sentences

for s in sentences(text):
    print(s)

The sentences function utilizes a tokenizer from the nltk library.

License

This project is distributed under the MIT license. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_chunker-0.2.2.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_chunker-0.2.2-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file text_chunker-0.2.2.tar.gz.

File metadata

  • Download URL: text_chunker-0.2.2.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for text_chunker-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d4a15c7210787105d49dbc6b0731b40eb8dda99a60aa37a2d3848da5995b60c5
MD5 e4201e488f3cc3c3c2dbb52f5a8a3b40
BLAKE2b-256 b887345aa794a34028fd6513fac540116fde60daa48be9392736785d2ec4f03c

See more details on using hashes here.

File details

Details for the file text_chunker-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: text_chunker-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 3.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for text_chunker-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 eb3a77e105fd7805cedb7782247efa8c1107c9d461c470ceecc95f2ea38e44b8
MD5 b336e0f4c22acbe7d952334e7c025fac
BLAKE2b-256 1248f62365014c7d518f8da3495a711ef8be75f26cee40ff6d6cd16326447130

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page