Skip to main content

Smart-Chunker is a semantic chunker to prepare a long document for RAG

Project description

This smart chunker is a semantic chunker to prepare a long document for retrieval augmented generation (RAG).

Unlike a usual chunker, it does not split the text into identical groups of N tokens. Instead, it uses a cross-encoder to calculate the similarity function between neighboring sentences and divides the text based on the most significant boundaries of semantic transitions, i.e. minima in the above-mentioned similarity function.

The BAAI/bge-reranker-v2-m3, or any other model that supports the AutoModelForSequenceClassification interface, should be used as a cross encoder.

The smart chunker supports Russian and English.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_chunker-0.0.4.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smart_chunker-0.0.4-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file smart_chunker-0.0.4.tar.gz.

File metadata

  • Download URL: smart_chunker-0.0.4.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for smart_chunker-0.0.4.tar.gz
Algorithm Hash digest
SHA256 e09b5045ae2eb857ebde94b6ad006ebed8a0fa5baf491225227a8eed3754043e
MD5 715cf414ae18ca5cae2b5139e541a501
BLAKE2b-256 a0d99e6540934a6c54cd9e61d3a02ef3e53cf86f80bb9fceb6f0a0a7756e7ba0

See more details on using hashes here.

File details

Details for the file smart_chunker-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: smart_chunker-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for smart_chunker-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7e3dca1134bf5dfce82f70f912c1bcd4936837afc3bcac8a3f17372dcedf17b9
MD5 3c627ddb9c32b5803045be50f1877d93
BLAKE2b-256 0e259d66467859748af5764318040e8a3129b11c0cbf7954ccd5ddf21bad0870

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page