Skip to main content

Knowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates

Project description

KARA - Efficient RAG Knowledge Base Updates

CI PyPI version Code style: ruff License: CC BY 4.0

KARA stands for Knowledge-Aware Reembedding Algorithm. The word "Kara" (کارآ) also means "efficient" in Persian.

KARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.

Installation

pip install kara-toolkit

# With LangChain integration
pip install kara-toolkit[langchain]

Key Parameters

Parameter Type Default Description
imperfect_chunk_tolerance int 9 Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.

- 0: No tolerance; disables chunk reuse.
- 1: Prefers new chunk over two imperfect ones.
- 9: Balanced default.
- 99+: Maximizes reuse, less uniform sizes.
chunk_size int 500 Target size (in characters) for each text chunk.
separators List[str] ["\n\n", "\n", " "] List of strings used to split the text. If not provided, uses default separators from RecursiveCharacterChunker.

Quick Start

from kara import KARAUpdater, RecursiveCharacterChunker

# Initialize
chunker = RecursiveCharacterChunker(chunk_size=500)
updater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)

# Process initial documents
result = updater.create_knowledge_base(["Your document content..."])

# Update with new content - reuses existing chunks automatically
update_result = updater.update_knowledge_base(
    result.new_chunked_doc,
    ["Updated document content..."]
)

print(f"Efficiency: {update_result.efficiency_ratio:.1%}")
print(f"Chunks reused: {update_result.num_reused}")

LangChain Integration

from kara.integrations.langchain import KARATextSplitter
from langchain_core.documents import Document

# Use as a drop-in replacement for LangChain text splitters
splitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)

docs = [Document(page_content="Your content...", metadata={"source": "file.pdf"})]
chunks = splitter.split_documents(docs)

Examples

See examples/ for complete usage examples.

How It Works

KARA formulates chunking as a graph optimization problem:

  1. Creates a DAG where nodes are split positions and edges are potential chunks
  2. Uses Dijkstra's algorithm to find optimal chunking paths
  3. Automatically reuses existing chunks to minimize embedding costs

Limitations

While KARA provides significant efficiency improvements for knowledge base updates, there are some current limitations to be aware of:

  • Document Version Dependency: The biggest limitation is that you need to keep the last version of documents to identify reusable chunks. However, you may be able to reconstruct document content using saved chunks in your vector store to reduce storage overhead. When compared to LangChain's indexing solution (documented here), which maintains a separate SQL database for chunk hashes while being extremely inefficient, our approach is still superior.

  • Chunking Configuration Changes: You likely cannot change splitting configurations (chunk size, separator characters) between updates, as this may disrupt the algorithm's optimal solution. We have not yet tested the extent to which configuration changes impact performance.

  • No Chunk Overlap Support: We currently do not support overlapping chunks, but we are investigating whether this feature can be added in future versions.

Roadmap to 1.0.0

  • 100% Test Coverage - Complete test suite with full coverage
  • Performance Benchmarks - Real-world efficiency testing
  • Framework Support - LlamaIndex, Haystack, and others
  • Complete Documentation - API reference, guides, and examples
  • Token-Based Optimal Chunking - Extend algorithm to support token-based chunking strategies

License

CC BY 4.0 License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kara_toolkit-0.2.1.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kara_toolkit-0.2.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file kara_toolkit-0.2.1.tar.gz.

File metadata

  • Download URL: kara_toolkit-0.2.1.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kara_toolkit-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0bc083b26d35e545b0eee4f8308c0fbd9b64e625adb052dad856bdfbedfc6794
MD5 107598c1658f33368450a28290f5695f
BLAKE2b-256 8429b33b91f61aebef3d3a3ba19416e4c3b9a0555b75af91a7233456d59d57fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for kara_toolkit-0.2.1.tar.gz:

Publisher: publish.yml on mzakizadeh/kara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kara_toolkit-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: kara_toolkit-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kara_toolkit-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 faa681936588c8526c7ed73bb57d02a4aba1f6d3faffb5dc06224414d4569080
MD5 905a81c42dbb7595bcf127d7d2673c59
BLAKE2b-256 c280b7ec634cc816d55fa8c61285f23a8dbc6aaead5e69cbb517bc273ff624ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for kara_toolkit-0.2.1-py3-none-any.whl:

Publisher: publish.yml on mzakizadeh/kara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page