Skip to main content

Chunkipy is an easy-to-use library for chunking text based on the tokenizer function you provide.

Project description

Chunkipy

Python 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11 PyPI version

chunkipy is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, chunkipy provides flexibility and control for various text processing tasks.

Motivation and Features

chunkipy was created to address the need within the field of natural language processing (NLP) to chunk text so that it does not exceed the input size of neural networks such as BERT, but it could be used for several other use cases.

The library offers some useful features:

  • Token estimation: unlike other text chunking libraries, chunkipy offers the possibility of providing a token estimator function, in order to build the chunks taking into account the tokenizer that will use those chunks.
  • Split text into meaningful sentences: in its default configuration, chunkipy, in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence. This is achieved through the use of the stanza library, which utilizes semantic models to cut text into meaningful sentences.
  • Smart Overlapping: chunkipy offers the possibility to define an overlap_percentage and create overlapping chunks to preserve the context along chunks.
  • Flexibility in choosing split strategies: Additionally, chunkipy offers complete flexibility in choosing how to split, allowing users to define their own text splitting function or choose from a list of pre-defined splitting strategies.

By default, chunkipy uses stanza are main text splitting method; however, if stanza produces sentences with a number of tokens greater than the chunk size, other split strategy are used. Here the list of predefined strategies, sorted by priority (the first one is executed first, if the chunk of text is larger than the chunk size, it is further split using a lower priority strategy).

Priority Name Effect
0 split_by_sentences It uses stanza to split the text into meaningful sentences.
1 split_by_semicolon It splits the text using the semicolon and space ; as separator.
2 split_by_colon It splits the text using the colon and space : as separator.
3 split_by_comma It splits the text using the comma and space , as separator.
4 split_by_word It splits the text using the space as separator.

Installation

You can install chunkipy using pip:

pip install chunkipy

Usage

The main class in chunkipy is TextChunker, which takes a text and splits it into chunks of a given size. You can use the default settings or specify custom parameters for the chunk size, whether to split by characters or tokens, the tokenizer function to use (if tokens is set to True`), and the list of split strategies to apply.

Here's an example of using it to chunk text that has a number of tokens greater than the input size of BERT:

from chunkipy import TextChunker, TokenEstimator
from transformers import AutoTokenizer

text = """Napoleon Bonaparte (born Napoleone Buonaparte; 15 August 1769 – 5 May 1821), later known by his regnal name Napoleon I, was a French military commander and political leader who rose to prominence during the French Revolution and led successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804, then Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy endures to this day, as a highly celebrated and controversial leader. He initiated many liberal reforms that have persisted in society, and is considered one of the greatest military commanders in history. His wars and campaigns are studied by militaries all over the world. Between three and six million civilians and soldiers perished in what became known as the Napoleonic Wars.
Napoleon was born on the island of Corsica, not long after its annexation by France, to a native family descending from minor Italian nobility. He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic.
Differences with the United Kingdom meant France faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm campaign, and at the Battle of Austerlitz, which led to the dissolution of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him. Napoleon defeated Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, and defeated the Russians in June 1807 at Friedland, forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram.
Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph the King of Spain in 1808. The Spanish and the Portuguese revolted in the Peninsular War aided by a British army, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France, resulting in a large coalition army defeating Napoleon at the Battle of Leipzig. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power."""

class BertTokenEstimator(TokenEstimator):
    def __init__(self):
        self.bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    def estimate_tokens(self, text):
        return len(self.bert_tokenizer.encode(text))


bert_token_estimator = BertTokenEstimator()

print(f"Num of chars: {len(text)}")
# --> Num of chars: 3149
print(f"Num of tokens (using BertTokenEstimator): {bert_token_estimator.estimate_tokens(text)}")
# --> Num of tokens (using BertTokenEstimator): 603

# This creates an instance of the TextChunker class with a chunk size of 512, 
# using token-based segmentation and a custom tokenizer function bert_tokenizer.encode to count tokens.


text_chunker = TextChunker(512, tokens=True, token_estimator=BertTokenEstimator())
chunks = text_chunker.chunk(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}, num of tokens: {bert_token_estimator.estimate_tokens(chunk)} -> {chunk}")

This code returns:

Chunk 1, num of tokens 476: -> Napoleon Bonaparte (born Napoleone Buonaparte; 15 August 1769 – 5 May 1821), later known by his regnal name Napoleon I, was a French military commander and political leader who rose to prominence during the French Revolution and led successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804, then Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy endures to this day, as a highly celebrated and controversial leader. He initiated many liberal reforms that have persisted in society, and is considered one of the greatest military commanders in history. His wars and campaigns are studied by militaries all over the world. Between three and six million civilians and soldiers perished in what became known as the Napoleonic Wars. Napoleon was born on the island of Corsica, not long after its annexation by France, to a native family descending from minor Italian nobility. He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic. Differences with the United Kingdom meant France faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm campaign, and at the Battle of Austerlitz, which led to the dissolution of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him. Napoleon defeated Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, and defeated the Russians in June 1807 at Friedland, forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram. Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph the King of Spain in 1808.
Chunk 2, num of tokens 129: -> The Spanish and the Portuguese revolted in the Peninsular War aided by a British army, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France, resulting in a large coalition army defeating Napoleon at the Battle of Leipzig. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power.

As you can see, chunkipy created chunks smaller than the given input size, 512, counted the tokens using BERT's tokenizer, and didn't cut sentences in half, producing syntactically correct chunks of text.

You can also use TextChunker's split_text() method to split a text into smaller parts without actually creating the chunks. This can be useful, for example, when you want to apply further processing to the text parts before creating the final chunks.

Overlapping

TextChunker provides you with the overlap functionality. The system aims to ensure that the last text parts of each segment do not exceed the maximum token limit defined for overlap. For example, if the chunk_size is set to 100 and the overlap_percentage is 0.1, the maximum number of overlapping tokens is 10. Consequently, if there are sentences or text parts that fit within this token limit, they are added at the beginning of the next segment. If not, they are skipped to maintain a good ratio between overlap and content.

text_chunker = TextChunker(50, tokens=True, overlap_percent=0.3)

# Set up test input
text = "In this unit test, we are evaluating the overlapping functionality." \
       "This is a feature of the TextChunker class, which is important for a proper context keeping. The " \
       "goal is to ensure that overlapping chunks are generated correctly. For this purpose, we have chosen a " \
       "long text that exceeds 100 tokens. By setting the overlap_percent to 0.3, we expect the " \
       "generated chunks to have an overlap of approximately 30%. This will help us verify the effectiveness " \
       "of the overlapping feature. The TextChunker class should be able to handle this scenario and " \
       "produce the expected results. Let's proceed with running the test and asserting the generated chunks " \
       "for proper overlap. "

# Generate chunks with overlapping
chunks = text_chunker.chunk(text)

# Print the resulting chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: {chunk}")

This would output:

Chunk 1: In this unit test, we are evaluating the overlapping functionality. This is a feature of the TextChunker class, which is important for a proper context keeping. The goal is to ensure that overlapping chunks are generated correctly. For this purpose, we have chosen a long text that exceeds 100 tokens.
Chunk 2: For this purpose, we have chosen a long text that exceeds 100 tokens. By setting the overlap_percent to 0.3, we expect the generated chunks to have an overlap of approximately 30%. This will help us verify the effectiveness of the overlapping feature.
Chunk 3: This will help us verify the effectiveness of the overlapping feature. The TextChunker class should be able to handle this scenario and produce the expected results. Let's proceed with running the test and asserting the generated chunks for proper overlap.

Provide a custom text split strategy

If you don't want to use stanza or you need to use your custom text splitting logic, you can provide it in the constructor, as shown in this example:

from chunkipy import TextChunker


def split_by_arrow(text):
    return [t for t in text.split("->") if t != '' and ' ']


text = "This is a tokenized text -> with custom split strategy."


# Create a TextChunker object with custom split strategy
text_chunker = TextChunker(chunk_size=8, tokens=True,
                           split_strategies=[split_by_arrow])
chunks = text_chunker.chunk(text)

# Print the resulting chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")

This would output:

Chunk 1: This is a tokenized text
Chunk 2: with custom split strategy.

Contributing

If you find a bug or have a feature request, please open an issue on GitHub. Contributions are welcome! Just fork the repository, create a new branch with your changes, and submit a pull request. Please make sure to write tests for your changes and to follow the code style.

License

TextChunker is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

chunkipy-0.0.3-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page