Skip to main content

A better way to split (chunk/group) your text before inserting them into an LLM/Vector DB.

Project description

Semantic-Split

A Python library to chunk/group your text based on semantic similarity - ideal for pre-processing data for Language Models or Vector Databases. Leverages SentenceTransformers and spaCy.

Why?

  1. Better Context: Providing more relevant context to your prompts enhances the LLM's performance (arXiv:2005.14165 [cs.CL]). Semantic-Split groups related sentences together, ensuring your prompts have relevant context.

  2. Improved Results: Short, precise prompts often yield the best results from LLMs (arXiv:2004.04906 [cs.CL]). By grouping semantically similar sentences, Semantic-Split helps you craft such efficient prompts.

  3. Cost Savings: LLMs like GPT-4 charge costs per token and have a token limit (e.g., 8K tokens). With Semantic-Split, you can make your prompts shorter and more meaningful, leading to potential cost savings.

Real world example:

Imagine you're building an application where users ask questions about articles:

  • A. We want to only add parts in the artcile that are relevant to our query (for better results).
  • B. We want to be able to query the article quickly (pre-processing).
  1. We want to pre-process the article - so each query is fast (point B). So we split it into semantic chunks using semantic-split and store it in a Vector DB as embeddings.
  2. Each time the user asks something we calculate the embedding for their question and find the top 3 similar chunks in our Vector DB.
  3. We add those 3 chunks to our pompt, to get better results for our user's questions.

As you can see, in part 1, which involves semantic sentence splitting (grouping), is crucial. If we don't split or group the sentences semantically, we risk losing essential information. This can diminish the effectiveness of our Vector DB in identifying the most suitable chunks. Consequently, we may end up with poorer context for our prompts, negatively impacting the quality of our responses.

Install

  1. To use most of the functionality you will need to install some pre-requisists
  2. Spacy sm dataset: python -m spacy download en_core_web_sm
  3. poetry install
  4. See examples

Examples

Sentence Split by Semantic Similarity

    from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter

    text = """
      I dogs are amazing.
      Cats must be the easiest pets around.
      Robots are advanced now with AI.
      Flying in space can only be done by Artificial intelligence."""

    model = SentenceTransformersSimilarity()
    sentence_splitter = SpacySentenceSplitter()
    splitter = SimilarSentenceSplitter(model, sentence_splitter)
    res = splitter.split(text)

Result:

[["I dogs are amazing.", "Cats must be the easiest pets around."],
["Robots are advanced now with AI.", "Flying in space can only be done by Artificial intelligence."]]

Tests

poetry run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_split-0.1.0.tar.gz (3.3 kB view details)

Uploaded Source

Built Distributions

semantic_split-0.1.0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

semantic_split-0.1.0-py2.py3-none-any.whl (5.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file semantic_split-0.1.0.tar.gz.

File metadata

  • Download URL: semantic_split-0.1.0.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for semantic_split-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0ce362864b34d4642f88e2bf45b05e17fb7ebfb9555cb6e7d0703e2cb3512c43
MD5 371af636b5d866ec38dfe015e090966f
BLAKE2b-256 def9abaf967304741740c217e23d3a6adcee8b2353527db82ce0bb1e25d9b231

See more details on using hashes here.

File details

Details for the file semantic_split-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: semantic_split-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for semantic_split-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c02eedbec9da7c510b9e0ad42f6aa117bb4077e7d4ffe491e3d8940d534774e1
MD5 33171dbd676dc7429c6fa7bf584d8607
BLAKE2b-256 ae0a61798449f783fa2c6d0c05a39150b3957146a2999b5fab8130ca6de4c6e1

See more details on using hashes here.

File details

Details for the file semantic_split-0.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_split-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6e6bd8168ac6574805f5e7aac37274cd73c94a568107a07e9d7ac129948978b7
MD5 46767a4cb7174cc94324d5b6e9292329
BLAKE2b-256 e7d7db87e57d4fbc42586a35739da07444ca76c480337fa096fe9ec50781259d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page