Skip to main content

A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Project description

semchunk

semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Installation 📦

semchunk may be installed with pip:

pip install semchunk

Usage 👩‍💻

The code snippet below demonstrates how text can be chunked with semchunk:

>>> import semchunk
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
) -> list[str]

chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

Methodology 🔬

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

  1. Splits text using the most semantically meaningful splitter possible;
  2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
  3. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
  4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

  1. The largest sequence of newlines (\n) and/or carriage returns (\r);
  2. The largest sequence of tabs;
  3. The largest sequence of whitespace characters (as defined by regex's \s character class);
  4. Sentence terminators (., ?, ! and *);
  5. Clause separators (;, ,, (, ), [, ], , , , , ', " and `);
  6. Sentence interrupters (:, and );
  7. Word joiners (/, \, , & and -); and
  8. All other characters.

Licence 📄

This library is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semchunk-0.1.0.tar.gz (7.1 kB view hashes)

Uploaded Source

Built Distribution

semchunk-0.1.0-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page