Skip to main content

Smart-Chunker is a semantic chunker to prepare a long document for RAG

Project description

This smart chunker is a semantic chunker to prepare a long document for retrieval augmented generation (RAG).

Unlike a usual chunker, it does not split the text into identical groups of N tokens. Instead, it uses a cross-encoder to calculate the similarity function between neighboring sentences and divides the text based on the most significant boundaries of semantic transitions, i.e. minima in the above-mentioned similarity function.

The BAAI/bge-reranker-v2-m3, or any other model that supports the AutoModelForSequenceClassification interface, should be used as a cross encoder.

The smart chunker supports Russian and English.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_chunker-0.0.5.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smart_chunker-0.0.5-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file smart_chunker-0.0.5.tar.gz.

File metadata

  • Download URL: smart_chunker-0.0.5.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for smart_chunker-0.0.5.tar.gz
Algorithm Hash digest
SHA256 935a62daa60bafd31c3c7496d751f723fbfb289c42ee7aa85355589d14726804
MD5 f1dab3ed43fa6f6eef99f32f76d16864
BLAKE2b-256 02c34c1307e29c9ab0d17df8d1e9600381147c138014a5415787f7d26ba0b943

See more details on using hashes here.

File details

Details for the file smart_chunker-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: smart_chunker-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for smart_chunker-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 18dc6cff5f1da0fa13990883ed823a40395faced826db21f9cd2327318d36e1a
MD5 b6d33118cbf4885123bf11754d07c417
BLAKE2b-256 a1029e9b63d8c4c20a8f39aa2c5ddaf2721f4242f95c7de7d1ef7e07bde18f41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page