A Python class that takes in long text as input and divides it into shorter chunks no longer than a specified length.
Project description
TextChunker
TextChunker is a Python class that takes in long text as input and divides it into shorter chunks no longer than a specified length. The purpose of this project is to provide a simple and useful tool for data processing tasks such as natural language processing and information extraction.
Install
pip intsll text_chunker
Usage
You can use the TextChunker class in your Python code as follows:
from text_chunker import TextChunker
# Create a new TextChunker object with a maximum chunk length of 50 characters
chunker = TextChunker(maxlen=1000)
# Chunk a long text string into smaller chunks
text = "This is a long text string..."
for chunk in chunker.chunk(text):
print(chunk)
The chunk method attempts to split paragraphs first while keeping chunk length below maxlen. If a paragraph is longer than maxlen, the method attempts to split the paragraph into sentences. If a sentence is longer than maxlen, it is split into smaller chunks no longer than maxlen.
There are also functions called paragraphs and sentences that divide the text into paragraphs and sentences, respectively.
from text_chunker import paragraphs
for p in paragraphs(text):
print(p)
from text_chunker import sentences
for s in sentences(text):
print(s)
The sentences function utilizes a tokenizer from the nltk library.
License
This project is distributed under the MIT license. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_chunker-0.2.2.tar.gz.
File metadata
- Download URL: text_chunker-0.2.2.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4a15c7210787105d49dbc6b0731b40eb8dda99a60aa37a2d3848da5995b60c5
|
|
| MD5 |
e4201e488f3cc3c3c2dbb52f5a8a3b40
|
|
| BLAKE2b-256 |
b887345aa794a34028fd6513fac540116fde60daa48be9392736785d2ec4f03c
|
File details
Details for the file text_chunker-0.2.2-py3-none-any.whl.
File metadata
- Download URL: text_chunker-0.2.2-py3-none-any.whl
- Upload date:
- Size: 3.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb3a77e105fd7805cedb7782247efa8c1107c9d461c470ceecc95f2ea38e44b8
|
|
| MD5 |
b336e0f4c22acbe7d952334e7c025fac
|
|
| BLAKE2b-256 |
1248f62365014c7d518f8da3495a711ef8be75f26cee40ff6d6cd16326447130
|