Skip to main content

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.

Project description

semantic-text-splitter

Documentation Status Licence

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

from semantic_text_splitter import TextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TextSplitter()
# splitter = TextSplitter(trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

from semantic_text_splitter import TextSplitter

splitter = TextSplitter()

# Maximum number of characters in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200,1000))

Using a Hugging Face Tokenizer

from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_tokens = 1000
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer)

chunks = splitter.chunks("your document text", max_tokens)

Using a Tiktoken Tokenizer

from semantic_text_splitter import TextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo")

chunks = splitter.chunks("your document text", max_tokens)

Using a Custom Callback

from semantic_text_splitter import TextSplitter

# Optionally can also have the splitter trim whitespace for you
splitter = TextSplitter.from_callback(lambda text: len(text))

# Maximum number of tokens in a chunk. Will fill up the
# chunk until it is somewhere in this range.
chunks = splitter.chunks("your document text", chunk_capacity=(200,1000))

Markdown

All of the above examples also can also work with Markdown text. You can use the MarkdownSplitter in the same ways as the TextSplitter.

from text_splitter import MarkdownSplitter
# Default implementation uses character count for chunk size.
# Can also use all of the same tokenizer implementations as `TextSplitter`.
splitter = MarkdownSplitter()

splitter.chunks("# Header\n\nyour document text", 1000)

Method

To preserve as much semantic meaning within a chunk as possible, each chunk is composed of the largest semantic units that can fit in the next given chunk. For each splitter type, there is a defined set of semantic levels. Here is an example of the steps used:

  1. Split the text by a increasing semantic levels.
  2. Check the first item for each level and select the highest level whose first item still fits within the chunk size.
  3. Merge as many of these neighboring sections of this level or above into a chunk to maximize chunk length. Boundaries of higher semantic levels are always included when merging, so that the chunk doesn't inadvertantly cross semantic boundaries.

The boundaries used to split the text if using the chunks method, in ascending order:

TextSplitter Semantic Levels

  1. Characters
  2. Unicode Grapheme Cluster Boundaries
  3. Unicode Word Boundaries
  4. Unicode Sentence Boundaries
  5. Ascending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level. So a sequence of 2 newlines is a higher level than a sequence of 1 newline, and so on.

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

MarkdownSplitter Semantic Levels

Markdown is parsed according to the CommonMark spec, along with some optional features such as GitHub Flavored Markdown.

  1. Characters
  2. Unicode Grapheme Cluster Boundaries
  3. Unicode Word Boundaries
  4. Unicode Sentence Boundaries
  5. Soft line breaks (single newline) which isn't necessarily a new element in Markdown.
  6. Inline elements such as: text nodes, emphasis, strong, strikethrough, link, image, table cells, inline code, footnote references, task list markers, and inline html.
  7. Block elements suce as: paragraphs, code blocks, footnote definitions, metadata. Also, a block quote or row/item within a table or list that can contain other "block" type elements, and a list or table that contains items.
  8. Thematic breaks or horizontal rules.
  9. Headings by level

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences

There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_splitter-0.10.0.tar.gz (261.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.8+Windows x86-64

semantic_text_splitter-0.10.0-cp38-abi3-win32.whl (3.2 MB view details)

Uploaded CPython 3.8+Windows x86

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (5.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ s390x

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (5.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ppc64le

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARMv7l

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.6 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl (4.7 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.12+ i686

semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file semantic_text_splitter-0.10.0.tar.gz.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0.tar.gz
Algorithm Hash digest
SHA256 c35d3ec408782e5478aac5f77294c880bd2f54d6bac94250b120a6323783c036
MD5 86c4421b51660ef749a338e083249992
BLAKE2b-256 5d3155091c24d7ab314352fd555e751c36a5fd242457649e88e96165c3e41680

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d48a01eeb86eb30de83402dc80f1624e0596bc3fd5c62f28428929ea0f877487
MD5 d682e0dead5b8ebef5191cfbd66c61fb
BLAKE2b-256 9fd65fdea29cb2c815da8c8b3792acde8ed1ccbe37123e94903976d487b84ab4

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-win32.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 6a958662c9d1c84121c4c48de1c91fd59ebe771bfdcce842c2927c321023762a
MD5 09f3185dbe332088e5c124868e94c4f3
BLAKE2b-256 27879759c15db8ebcf24331673df9096d7f96a2cc6bcd94b78caff21be8c2f08

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 18130279520efdd2f8a7a7523a8fcc611f356f2578fb4b0b27fc88edd714c980
MD5 d400172ebd4d79939e99ce9ceda17672
BLAKE2b-256 293315fa9ecc4ed5491e0c63f9e4884785305184dc105d1376ab3946830ca902

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 75e5ef4097d88e8a514029c2f3a97c9fc34aeb21da2368a211bad05a8248a43c
MD5 5e552d99e6f39e0034b92d53e171b88e
BLAKE2b-256 f6f9859af576e4c4dfebf926a705402fe45a14ead96ebeaf6479141c8085fd61

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 7c8440b0f2f95f16dbd9a1f7b3b2a995b370362b1893b2d409762c7cd952fc9e
MD5 b9a882dc83e301575e70102df0cb14ca
BLAKE2b-256 3f1092d16ddbdfe1c59da5f32b276857119d6d7b7754a38204c16e9b562d920a

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 e1fdcb6e6ae6b0a81b4e06233c4f3d8a007966fddc8dd0e9907359b17ae17e45
MD5 0970235868526acd36d4c5bef6f9eb3c
BLAKE2b-256 a5d47549e307d2b2a46ffeb4c926499f00d698aa2b0618e568c6cfebf499038f

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7be29358d846426baad4a60681114ba0d65da214c18e5f58bd078cb26f973730
MD5 05dbc4128f6fbdb6aee3cc9a4128c66f
BLAKE2b-256 b9f53bb17010ad7419b8b57542c9ca8c25f7e253da73c303f8bf562e0bc9468a

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 c1b3d43ca5e131cfb0b1acf5ea4d5e788854e3889467c475f36e035f6b2c260c
MD5 7366f95ff482286e4fb03cf738f89dde
BLAKE2b-256 f73a4eb1a39c0f503f1f811201e9a5bb24acac436261da1d59ff43f5187f2381

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c6775d3ef21e3bb0cd23c8253a5eb050febeb910f412713e892fc80291c87d3
MD5 d56209b7d96cf653b63ed6a82a53aae8
BLAKE2b-256 49dca117f38513f3087b823d08e271d8531b0960b9b51f827b7af07f739d3d5f

See more details on using hashes here.

File details

Details for the file semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for semantic_text_splitter-0.10.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 efe0f4f355da41073632858686bed91798167af31b77ad9216df560d31e19eb8
MD5 4610ad0b730bd213aaebf43527e856e6
BLAKE2b-256 cbdfa1b2a802cfce037b0341cdbda769d8279b994cda63e4fd0654b043681689

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page