Skip to main content

Smart-Chunker is a semantic chunker to prepare a long document for RAG

Project description

This smart chunker is a semantic chunker to prepare a long document for retrieval augmented generation (RAG).

Unlike a usual chunker, it does not split the text into identical groups of N tokens. Instead, it uses a cross-encoder to calculate the similarity function between neighboring sentences and divides the text based on the most significant boundaries of semantic transitions, i.e. minima in the above-mentioned similarity function.

The BAAI/bge-reranker-v2-m3, or any other model that supports the AutoModelForSequenceClassification interface, should be used as a cross encoder.

The smart chunker supports Russian and English.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_chunker-0.0.3.tar.gz (3.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smart_chunker-0.0.3-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file smart_chunker-0.0.3.tar.gz.

File metadata

  • Download URL: smart_chunker-0.0.3.tar.gz
  • Upload date:
  • Size: 3.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for smart_chunker-0.0.3.tar.gz
Algorithm Hash digest
SHA256 bf9a709d5319551830908a577628c706bba7b4e865e10f403a8d93970d6e8710
MD5 197d78040d4b2bd5bf3e95bfc079b138
BLAKE2b-256 2f6f0582c3c6040a07d75193c4b75eb52e816d156ae5a661ec27a88319c83689

See more details on using hashes here.

File details

Details for the file smart_chunker-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: smart_chunker-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for smart_chunker-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bcfc5d1b430abb74eb60de329fd5e62498a045350b5f6d5a02d2f6f1d590e17d
MD5 7781833e1128d91f33dad4204cc5d3f6
BLAKE2b-256 d9c5497b87616638e1705ae10e82efd997c16bb9cc5327b5dbe2401c86b80eb1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page