A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
Project description
semchunk
semchunk
is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
Installation 📦
semchunk
may be installed with pip
:
pip install semchunk
Usage 👩💻
The code snippet below demonstrates how text can be chunked with semchunk
:
>>> import semchunk
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`.
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
Chunk
def chunk(
text: str,
chunk_size: int,
token_counter: callable,
) -> list[str]
chunk()
splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.
text
is the text to be chunked.
chunk_size
is the maximum number of tokens a chunk may contain.
token_counter
is a callable that takes a string and returns the number of tokens in it.
This function returns a list of chunks up to chunk_size
-tokens-long, with any whitespace used to split the text removed.
Methodology 🔬
semchunk
works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
- Splits text using the most semantically meaningful splitter possible;
- Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
- Merges any chunks that are under the chunk size back together until the chunk size is reached; and
- Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.
To ensure that chunks are as semantically meaningful as possible, semchunk
uses the following splitters, in order of precedence:
- The largest sequence of newlines (
\n
) and/or carriage returns (\r
); - The largest sequence of tabs;
- The largest sequence of whitespace characters (as defined by regex's
\s
character class); - Sentence terminators (
.
,?
,!
and*
); - Clause separators (
;
,,
,(
,)
,[
,]
,“
,”
,‘
,’
,'
,"
and`
); - Sentence interrupters (
:
,—
and…
); - Word joiners (
/
,\
,–
,&
and-
); and - All other characters.
Licence 📄
This library is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.