Skip to main content

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models).

Project description

semantic-text-splitter

Documentation Status Licence

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import CharacterTextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

With Huggingface Tokenizer

from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

With Tiktoken Tokenizer

from semantic_text_splitter import TiktokenTextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TiktokenTextSplitter("gpt-3.5-turbo", trim_chunks=False)

chunks = splitter.chunks("your document text", max_tokens)

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import CharacterTextSplitter

# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter()

# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200, 1000))

Method

To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:

  1. Split the text by a given level
  2. For each section, does it fit within the chunk size?
    • Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
    • No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level chunks method, in descending length:

  1. Descending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level.
  2. Unicode Sentence Boundaries
  3. Unicode Word Boundaries
  4. Unicode Grapheme Cluster Boundaries
  5. Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.2.2.tar.gz (226.5 kB view details)

Uploaded Source

Built Distributions

semantic_text_splitter-0.2.2-cp37-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.7+Windows x86-64

semantic_text_splitter-0.2.2-cp37-abi3-win32.whl (2.9 MB view details)

Uploaded CPython 3.7+Windows x86

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (4.8 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ s390x

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (4.8 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ppc64le

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.3 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ARMv7l

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.4 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ARM64

semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_12_i686.manylinux2010_i686.whl (4.5 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.12+ i686

semantic_text_splitter-0.2.2-cp37-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.7+macOS 11.0+ ARM64

semantic_text_splitter-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.7+macOS 10.7+ x86-64

File details

Details for the file semantic_text_splitter-0.2.2.tar.gz.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2.tar.gz
Algorithm Hash digest
SHA256 efe1e7b9638315729254e700cf5c174342a689ec3bab667ea250128631357a7c
MD5 fa6ccf423c1103582e1981f708bbffd8
BLAKE2b-256 a75a6eeb0369f2488d69b0bbb1dcfca116a405f6dccb89b2dea98e1947c1ef83

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c79bac326cd7db47a1c69dadd3c6354ec721d9191c0d98c997a70177e39727ea
MD5 70f5db89d546356673eb88643dd1c3d9
BLAKE2b-256 3a82db35cc70ea3a9dd69841de1fa5e48d2108141e5abf112afacd905422ee1e

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-win32.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-win32.whl
Algorithm Hash digest
SHA256 13cbefba3b33bcdbf7bd7b89121727c70bd7800793f13e16a819ed8f52587fec
MD5 35a11d7c6b3cf2e4c7925535e885f535
BLAKE2b-256 26c4529359b3bd9bb5edcdf9f6a49fccc80be4dca2051dd0cc7a74e4145fde1e

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 85efe783416bcdd426db049bfc99820ed0f760cc1f3a97c40692750b005b1aca
MD5 917dfacb0d8b971d1e519301e47ad9d1
BLAKE2b-256 0bae91e5e9b12be80f5c8adc8d922d3e585ffa55f36e748762f2d33ce0711efe

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 d3ac9843dee51a56ee178f437486194a2d74c0b83b82596fe2fe0057030adc99
MD5 6acd552596810e8566a13d03733a141f
BLAKE2b-256 83270dfe80e09cbb72a0cc5ae3f05bf0011718bc94d1f1a0cd540a475a2e2836

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 923ebc9fb3a74fd3dd8cd188018662b8cef269179f3ee4a40c17d77a40db0eaa
MD5 38b64dab48915a35c0bb14d8853712cc
BLAKE2b-256 19521cd9e3910f791fd6c26a6382e7c3027228a1a409cf30584ab1f6a4f7e140

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 2b640e6b10da83cf8394d23d3ee7294e05e801147eb0f5b65b770b335d71269a
MD5 2af537264452d59b7abd570bcb565769
BLAKE2b-256 e09154f8f70d816e95abdd01a6a658a642ed645e37297bfd701c2441abc8dec5

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 527f6c214d0f0f714bbe0422368ecf1b7cadbbe96856d64b79e4102057d29af8
MD5 a84151032f123c86ea78af34a8a8dee8
BLAKE2b-256 80d8ef690fd17f9e5440bd07383a6ea9e61d0db5dcdac393d48f8d1d89cff5ef

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 adede2e002934dab5004c65dc35479eb48c082b4a7e4446eb7c4ed7533c579fc
MD5 f96163439da2647d95b7864e2520e605
BLAKE2b-256 49360456e28718f23c7fe1549910582330498157cf2b69ca18be29428af8d26f

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4d73cdef3a7880c2051597c062d6b90c45b1a113b59aa62b4aabfb60e717b507
MD5 6b1cb4a9c8b1ce1b452e0f33f6bd0c5b
BLAKE2b-256 d1c6fd17feb7aca4a4b2fbe8f33eae85568ef8e85ff1ed5fdd9b500f4ee5669e

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 452b993ee4cc756ba68059dad8115063252badf88fc37bd92240cc216d7ce843
MD5 84877f4093b7ef147e9e076b297f7974
BLAKE2b-256 180f148f0105ca0ff6d8c9048194454e1907a7f5887ce9b4edb7d87b656051f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page