A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

These details have not been verified by PyPI

Project links

Project description

semchunk

semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Owing to its complex yet highly efficient chunking algorithm, semchunk is both more semantically accurate than langchain.text_splitter.RecursiveCharacterTextSplitter (see How It Works 🔍) and is also over 60% faster than semantic-text-splitter (see the Benchmarks 📊).

Installation 📦

semchunk may be installed with pip:

pip install semchunk

Usage 👩‍💻

The code snippet below demonstrates how text can be chunked with semchunk:

>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
) -> list[str]

chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

How It Works 🔍

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

Splits text using the most semantically meaningful splitter possible;
Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
Merges any chunks that are under the chunk size back together until the chunk size is reached; and
Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

The largest sequence of newlines (\n) and/or carriage returns (\r);
The largest sequence of tabs;
The largest sequence of whitespace characters (as defined by regex's \s character class);
Sentence terminators (., ?, ! and *);
Clause separators (;, ,, (, ), [, ], “, ”, ‘, ’, ', " and `);
Sentence interrupters (:, — and …);
Word joiners (/, \, –, & and -); and
All other characters.

Benchmarks 📊

On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes semchunk 35.75 seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter 1 minute and 50.5 seconds to chunk the same texts into 512-token-long chunks — a difference of 67.65%.

The code used to benchmark semchunk and semantic-text-splitter is available here.

Licence 📄

This library is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.2.5

Oct 28, 2025

3.2.4

Oct 26, 2025

3.2.3

Aug 13, 2025

3.2.2

Jun 9, 2025

3.2.1

Mar 27, 2025

3.2.0

Mar 20, 2025

3.1.3

Mar 10, 2025

3.1.2

Mar 6, 2025

3.1.1

Feb 18, 2025

3.1.0

Feb 16, 2025

3.0.4

Feb 13, 2025

3.0.3

Feb 13, 2025

3.0.2 yanked

Feb 13, 2025

Reason this release was yanked:

Typo in README and tests.

3.0.1

Jan 10, 2025

3.0.0

Dec 31, 2024

2.2.2

Dec 17, 2024

2.2.1 yanked

Dec 17, 2024

2.2.0

Jul 12, 2024

2.1.0

Jun 20, 2024

2.0.0

Jun 19, 2024

1.0.1

Jun 2, 2024

1.0.0

Jun 2, 2024

0.3.2

Jun 1, 2024

0.3.1

May 18, 2024

0.3.0

May 18, 2024

0.2.4

May 13, 2024

0.2.3

Mar 11, 2024

0.2.2

Feb 6, 2024

0.2.1

Nov 9, 2023

0.2.0

Nov 7, 2023

0.1.2

Nov 7, 2023

This version

0.1.1

Nov 6, 2023

0.1.0

Nov 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semchunk-0.1.1.tar.gz (7.0 kB view details)

Uploaded Nov 6, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semchunk-0.1.1-py3-none-any.whl (6.3 kB view details)

Uploaded Nov 6, 2023 Python 3

File details

Details for the file semchunk-0.1.1.tar.gz.

File metadata

Download URL: semchunk-0.1.1.tar.gz
Upload date: Nov 6, 2023
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for semchunk-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`08817d37fc498d553317ba25b4b0c13cd755b36e116aacb17b02e3393ed4abbd`
MD5	`d70c8bd6ea1a5e0e26b23f4722d13872`
BLAKE2b-256	`8a78e431c69f83656fbd6da22f4a5c9aa3397b282e699a0d325b674d8483aeb1`

See more details on using hashes here.

File details

Details for the file semchunk-0.1.1-py3-none-any.whl.

File metadata

Download URL: semchunk-0.1.1-py3-none-any.whl
Upload date: Nov 6, 2023
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for semchunk-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`571f06a33eca29f778de3f99bec7c01511a4136cbdb0b89a5bc4fd1f01f3e68d`
MD5	`ecd50a0e2dde2ee604c67426b50a360f`
BLAKE2b-256	`344b52fe230972fa718c565a7b5bf7d2c3a1738535cec9779eea11bd209facdb`

See more details on using hashes here.

semchunk 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

semchunk

Installation 📦

Usage 👩‍💻

Chunk

How It Works 🔍

Benchmarks 📊

Licence 📄

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes