Skip to main content

A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Project description

semchunk

semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Owing to its complex yet highly efficient chunking algorithm, semchunk is both more semantically accurate than langchain.text_splitter.RecursiveCharacterTextSplitter (see How It Works 🔍) and is also over 60% faster than semantic-text-splitter (see the Benchmarks 📊).

Installation 📦

semchunk may be installed with pip:

pip install semchunk

Usage 👩‍💻

The code snippet below demonstrates how text can be chunked with semchunk:

>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
) -> list[str]

chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

How It Works 🔍

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

  1. Splits text using the most semantically meaningful splitter possible;
  2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
  3. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
  4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

  1. The largest sequence of newlines (\n) and/or carriage returns (\r);
  2. The largest sequence of tabs;
  3. The largest sequence of whitespace characters (as defined by regex's \s character class);
  4. Sentence terminators (., ?, ! and *);
  5. Clause separators (;, ,, (, ), [, ], , , , , ', " and `);
  6. Sentence interrupters (:, and );
  7. Word joiners (/, \, , & and -); and
  8. All other characters.

Benchmarks 📊

On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes semchunk 35.75 seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter 1 minute and 50.5 seconds to chunk the same texts into 512-token-long chunks — a difference of 67.65%.

The code used to benchmark semchunk and semantic-text-splitter is available here.

Licence 📄

This library is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semchunk-0.1.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semchunk-0.1.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file semchunk-0.1.1.tar.gz.

File metadata

  • Download URL: semchunk-0.1.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for semchunk-0.1.1.tar.gz
Algorithm Hash digest
SHA256 08817d37fc498d553317ba25b4b0c13cd755b36e116aacb17b02e3393ed4abbd
MD5 d70c8bd6ea1a5e0e26b23f4722d13872
BLAKE2b-256 8a78e431c69f83656fbd6da22f4a5c9aa3397b282e699a0d325b674d8483aeb1

See more details on using hashes here.

File details

Details for the file semchunk-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: semchunk-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for semchunk-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 571f06a33eca29f778de3f99bec7c01511a4136cbdb0b89a5bc4fd1f01f3e68d
MD5 ecd50a0e2dde2ee604c67426b50a360f
BLAKE2b-256 344b52fe230972fa718c565a7b5bf7d2c3a1738535cec9779eea11bd209facdb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page