A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

These details have not been verified by PyPI

Project links

Project description

semchunk 🧩

semchunk by Isaacus is a fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

It has built-in support for tokenizers from OpenAI's tiktoken and Hugging Face's transformers and tokenizers libraries, in addition to supporting custom tokenizers and token counters. It can also overlap chunks as well as return their offsets.

Powered by an efficient yet highly accurate chunking algorithm (How It Works 🔍), semchunk produces chunks that are more semantically meaningful than regular token and recursive character chunkers like langchain's RecursiveCharacterTextSplitter, while also being 85% faster than its closest alternative, semantic-text-splitter (Benchmarks 📊).

semchunk is production ready, being used every day in the Isaacus API to split extremely long legal documents into more manageable chunks for our Kanon legal AI models.

Installation 📦

semchunk can be installed with pip:

pip install semchunk

semchunk is also available on conda-forge:

conda install conda-forge::semchunk
# or
conda install -c conda-forge semchunk

In addition, @dominictarro maintains a Rust port of semchunk named semchunk-rs.

Quickstart 👩‍💻

The code snippet below demonstrates how to chunk text with semchunk:

import semchunk
import tiktoken                        # `transformers` and `tiktoken` are not required.
from transformers import AutoTokenizer # They're just here for demonstration purposes.

chunk_size = 4 # A low chunk size is used here for demonstration purposes. Keep in mind, `semchunk`
               # does not know how many special tokens, if any, your tokenizer adds to every input,
               # so you may want to deduct the number of special tokens added from your chunk size.
text = 'The quick brown fox jumps over the lazy dog.'

# You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,
# OpenAI `tiktoken` encoding or Hugging Face model, or a custom tokenizer that has an `encode()`
# method (like a `tiktoken`, `transformers` or `tokenizers` tokenizer) or a custom token counting
# function that takes a text and returns the number of tokens in it.
chunker = semchunk.chunkerify('isaacus/kanon-tokenizer', chunk_size) or \
          semchunk.chunkerify('gpt-4', chunk_size) or \
          semchunk.chunkerify('cl100k_base', chunk_size) or \
          semchunk.chunkerify(AutoTokenizer.from_pretrained('isaacus/kanon-tokenizer'), chunk_size) or \
          semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
          semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

# If you give the resulting chunker a single text, it'll return a list of chunks. If you give it a
# list of texts, it'll return a list of lists of chunks.
assert chunker(text) == ['The quick brown fox', 'jumps over the', 'lazy dog.']
assert chunker([text], progress = True) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]

# If you have a lot of texts and you want to speed things up, you can enable multiprocessing by
# setting `processes` to a number greater than 1.
assert chunker([text], processes = 2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]

# You can also pass an `offsets` argument to return the offsets of chunks, as well as an `overlap`
# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
chunks, offsets = chunker(text, offsets = True, overlap = 0.5)

Usage 🕹️

`chunkerify()`

def chunkerify(
    tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | \
                                tokenizers.Tokenizer | Callable[[str], int],
    chunk_size: int = None,
    max_token_chars: int = None,
    memoize: bool = True,
    cache_maxsize: int | None = None,
) -> Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]:

chunkerify() constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.

tokenizer_or_token_counter is either: the name of a tiktoken or transformers tokenizer (with priority given to the former); a tokenizer that possesses an encode attribute (e.g., a tiktoken, transformers or tokenizers tokenizer); or a token counter that returns the number of tokens in an input.

chunk_size is the maximum number of tokens a chunk may contain. It defaults to None in which case it will be set to the same value as the tokenizer's model_max_length attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a ValueError will be raised.

max_token_chars is the maximum number of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to None in which case it will either not be used or will, if possible, be set to the number of characters in the longest token in the tokenizer's vocabulary as determined by the token_byte_values or get_vocab methods.

memoize flags whether to memoize the token counter. It defaults to True.

cache_maxsize is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to None, which makes the cache unbounded. This argument is only used if memoize is True.

This function returns a chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to chunk_size-tokens-long with any whitespace used to split the text removed, and, if the optional offsets argument to the chunker is True, a list or lists of tuples of the form (start, end) where start is the index of the first character of a chunk in a text and end is the index of the character succeeding the last character of the chunk such that chunks[i] == text[offsets[i][0]:offsets[i][1]].

The resulting chunker can be passed a processes argument that specifies the number of processes to be used when chunking multiple texts.

It is also possible to pass a progress argument which, if set to True and multiple texts are passed, will display a progress bar.

As described above, the offsets argument, if set to True, will cause the chunker to return the start and end offsets of each chunk.

The chunker accepts an overlap argument that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to None, in which case no overlapping occurs.

`chunk()`

def chunk(
    text: str,
    chunk_size: int,
    token_counter: Callable,
    memoize: bool = True,
    offsets: bool = False,
    overlap: float | int | None = None,
    cache_maxsize: int | None = None,
) -> list[str]

chunk() splits a text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

memoize flags whether to memoize the token counter. It defaults to True.

offsets flags whether to return the start and end offsets of each chunk. It defaults to False.

overlap specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to None, in which case no overlapping occurs.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed, and, if offsets is True, a list of tuples of the form (start, end) where start is the index of the first character of the chunk in the original text and end is the index of the character after the last character of the chunk such that chunks[i] == text[offsets[i][0]:offsets[i][1]].

How It Works 🔍

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

Splits text using the most semantically meaningful splitter possible;
Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
Merges any chunks that are under the chunk size back together until the chunk size is reached;
Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
Since version 3.0.0, excludes chunks consisting entirely of whitespace characters.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

The largest sequence of newlines (\n) and/or carriage returns (\r);
The largest sequence of tabs;
The largest sequence of whitespace characters (as defined by regex's \s character class) or, since version 3.2.0, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
Sentence terminators (., ?, ! and *);
Clause separators (;, ,, (, ), [, ], “, ”, ‘, ’, ', " and `);
Sentence interrupters (:, — and …);
Word joiners (/, \, –, & and -); and
All other characters.

If overlapping chunks have been requested, semchunk also:

Internally reduces the chunk size to min(overlap, chunk_size - overlap) (overlap being computed as floor(chunk_size * overlap) for relative overlaps and min(overlap, chunk_size - 1) for absolute overlaps); and
Merges every floor(original_chunk_size / reduced_chunk_size) chunks starting from the first chunk and then jumping by floor((original_chunk_size - overlap) / reduced_chunk_size) chunks until the last chunk is reached.

Benchmarks 📊

On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes semchunk 3.04 seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter (with multiprocessing) 24.84 seconds to chunk the same texts into 512-token-long chunks — a difference of 87.76%.

The code used to benchmark semchunk and semantic-text-splitter is available here.

Licence 📄

This library is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.2.2

Jun 9, 2025

3.2.1

Mar 27, 2025

3.2.0

Mar 20, 2025

3.1.3

Mar 10, 2025

3.1.2

Mar 6, 2025

3.1.1

Feb 18, 2025

3.1.0

Feb 16, 2025

3.0.4

Feb 13, 2025

3.0.3

Feb 13, 2025

3.0.2 yanked

Feb 13, 2025

Reason this release was yanked:

Typo in README and tests.

3.0.1

Jan 10, 2025

3.0.0

Dec 31, 2024

2.2.2

Dec 17, 2024

2.2.1 yanked

Dec 17, 2024

2.2.0

Jul 12, 2024

2.1.0

Jun 20, 2024

2.0.0

Jun 19, 2024

1.0.1

Jun 2, 2024

1.0.0

Jun 2, 2024

0.3.2

Jun 1, 2024

0.3.1

May 18, 2024

0.3.0

May 18, 2024

0.2.4

May 13, 2024

0.2.3

Mar 11, 2024

0.2.2

Feb 6, 2024

0.2.1

Nov 9, 2023

0.2.0

Nov 7, 2023

0.1.2

Nov 7, 2023

0.1.1

Nov 6, 2023

0.1.0

Nov 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semchunk-3.2.2.tar.gz (17.9 kB view details)

Uploaded Jun 9, 2025 Source

Built Distribution

semchunk-3.2.2-py3-none-any.whl (13.1 kB view details)

Uploaded Jun 9, 2025 Python 3

File details

Details for the file semchunk-3.2.2.tar.gz.

File metadata

Download URL: semchunk-3.2.2.tar.gz
Upload date: Jun 9, 2025
Size: 17.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for semchunk-3.2.2.tar.gz
Algorithm	Hash digest
SHA256	`22182fb6d9dc6fdf671242917284f7be67c2cf81806b642ab2aaf65c65e99ea4`
MD5	`b022dff9aca3dfd82288ddbc95689d68`
BLAKE2b-256	`07ed162c72a34b57d0d688e049b2007700eaee214273699149790adffa72b3bf`

See more details on using hashes here.

File details

Details for the file semchunk-3.2.2-py3-none-any.whl.

File metadata

Download URL: semchunk-3.2.2-py3-none-any.whl
Upload date: Jun 9, 2025
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for semchunk-3.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5dc0ba8988169064482881fed04f8fe5b7225cf4a1e8cbe814b6da7124b18214`
MD5	`14f05b8ee2a6ab4a4aa9225f01a2cf7f`
BLAKE2b-256	`acb847cf752851ec29f7fb76041114a1d27a435ed40ba16d236d7b7c1e15b989`

See more details on using hashes here.

semchunk 3.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

semchunk 🧩

Installation 📦

Quickstart 👩‍💻

Usage 🕹️

`chunkerify()`

`chunk()`

How It Works 🔍

Benchmarks 📊

Licence 📄

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes