Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models).

These details have not been verified by PyPI

Project links

Source Code

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

semantic-text-splitter

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import CharacterTextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

With Huggingface Tokenizer

from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

With Tiktoken Tokenizer

from semantic_text_splitter import TiktokenTextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TiktokenTextSplitter("gpt-3.5-turbo", trim_chunks=False)

chunks = splitter.chunks("your document text", max_tokens)

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import CharacterTextSplitter

# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter()

# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200, 1000))

Method

To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:

Split the text by a given level
For each section, does it fit within the chunk size?
- Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
- No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level chunks method, in descending length:

Descending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level.
Unicode Sentence Boundaries
Unicode Word Boundaries
Unicode Grapheme Cluster Boundaries
Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details

These details have not been verified by PyPI

Project links

Source Code

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.13.1

May 7, 2024

0.13.0

May 5, 2024

0.12.3

May 1, 2024

0.12.2

Apr 28, 2024

0.12.1

Apr 26, 2024

0.12.0

Apr 23, 2024

0.11.0

Apr 18, 2024

0.10.0

Apr 7, 2024

0.9.1

Apr 4, 2024

0.9.0

Apr 4, 2024

0.8.1

Mar 26, 2024

0.8.0

Mar 26, 2024

0.7.0

Mar 9, 2024

0.6.3

Jan 20, 2024

0.6.2

Jan 20, 2024

0.6.0

Jan 14, 2024

0.5.1

Jan 13, 2024

0.3.1

Dec 27, 2023

0.2.4

Nov 15, 2023

0.2.3

Sep 11, 2023

This version

0.2.2

Jul 2, 2023

0.2.1

Jun 13, 2023

0.2.0

Jun 12, 2023

0.1.4

Jun 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.2.2.tar.gz (226.5 kB view hashes)

Uploaded Jul 2, 2023 Source

Built Distributions

semantic_text_splitter-0.2.2-cp37-abi3-win_amd64.whl (3.0 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ Windows x86-64

semantic_text_splitter-0.2.2-cp37-abi3-win32.whl (2.9 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ Windows x86

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ manylinux: glibc 2.17+ x86-64

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (4.8 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ manylinux: glibc 2.17+ s390x

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (4.8 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ manylinux: glibc 2.17+ ppc64le

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.3 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ manylinux: glibc 2.17+ ARMv7l

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.4 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ manylinux: glibc 2.17+ ARM64

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_12_i686.manylinux2010_i686.whl (4.5 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ manylinux: glibc 2.12+ i686

semantic_text_splitter-0.2.2-cp37-abi3-macosx_11_0_arm64.whl (3.3 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ macOS 11.0+ ARM64

semantic_text_splitter-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl (3.4 MB view hashes)

Uploaded Jul 2, 2023 CPython 3.7+ macOS 10.7+ x86-64

Hashes for semantic_text_splitter-0.2.2.tar.gz

Hashes for semantic_text_splitter-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`efe1e7b9638315729254e700cf5c174342a689ec3bab667ea250128631357a7c`
MD5	`fa6ccf423c1103582e1981f708bbffd8`
BLAKE2b-256	`a75a6eeb0369f2488d69b0bbb1dcfca116a405f6dccb89b2dea98e1947c1ef83`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-win_amd64.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`c79bac326cd7db47a1c69dadd3c6354ec721d9191c0d98c997a70177e39727ea`
MD5	`70f5db89d546356673eb88643dd1c3d9`
BLAKE2b-256	`3a82db35cc70ea3a9dd69841de1fa5e48d2108141e5abf112afacd905422ee1e`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-win32.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-win32.whl
Algorithm	Hash digest
SHA256	`13cbefba3b33bcdbf7bd7b89121727c70bd7800793f13e16a819ed8f52587fec`
MD5	`35a11d7c6b3cf2e4c7925535e885f535`
BLAKE2b-256	`26c4529359b3bd9bb5edcdf9f6a49fccc80be4dca2051dd0cc7a74e4145fde1e`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`85efe783416bcdd426db049bfc99820ed0f760cc1f3a97c40692750b005b1aca`
MD5	`917dfacb0d8b971d1e519301e47ad9d1`
BLAKE2b-256	`0bae91e5e9b12be80f5c8adc8d922d3e585ffa55f36e748762f2d33ce0711efe`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm	Hash digest
SHA256	`d3ac9843dee51a56ee178f437486194a2d74c0b83b82596fe2fe0057030adc99`
MD5	`6acd552596810e8566a13d03733a141f`
BLAKE2b-256	`83270dfe80e09cbb72a0cc5ae3f05bf0011718bc94d1f1a0cd540a475a2e2836`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm	Hash digest
SHA256	`923ebc9fb3a74fd3dd8cd188018662b8cef269179f3ee4a40c17d77a40db0eaa`
MD5	`38b64dab48915a35c0bb14d8853712cc`
BLAKE2b-256	`19521cd9e3910f791fd6c26a6382e7c3027228a1a409cf30584ab1f6a4f7e140`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm	Hash digest
SHA256	`2b640e6b10da83cf8394d23d3ee7294e05e801147eb0f5b65b770b335d71269a`
MD5	`2af537264452d59b7abd570bcb565769`
BLAKE2b-256	`e09154f8f70d816e95abdd01a6a658a642ed645e37297bfd701c2441abc8dec5`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`527f6c214d0f0f714bbe0422368ecf1b7cadbbe96856d64b79e4102057d29af8`
MD5	`a84151032f123c86ea78af34a8a8dee8`
BLAKE2b-256	`80d8ef690fd17f9e5440bd07383a6ea9e61d0db5dcdac393d48f8d1d89cff5ef`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_12_i686.manylinux2010_i686.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm	Hash digest
SHA256	`adede2e002934dab5004c65dc35479eb48c082b4a7e4446eb7c4ed7533c579fc`
MD5	`f96163439da2647d95b7864e2520e605`
BLAKE2b-256	`49360456e28718f23c7fe1549910582330498157cf2b69ca18be29428af8d26f`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-macosx_11_0_arm64.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`4d73cdef3a7880c2051597c062d6b90c45b1a113b59aa62b4aabfb60e717b507`
MD5	`6b1cb4a9c8b1ce1b452e0f33f6bd0c5b`
BLAKE2b-256	`d1c6fd17feb7aca4a4b2fbe8f33eae85568ef8e85ff1ed5fdd9b500f4ee5669e`

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl
Algorithm	Hash digest
SHA256	`452b993ee4cc756ba68059dad8115063252badf88fc37bd92240cc216d7ce843`
MD5	`84877f4093b7ef147e9e076b297f7974`
BLAKE2b-256	`180f148f0105ca0ff6d8c9048194454e1907a7f5887ce9b4edb7d87b656051f3`