Skip to main content

Knowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates

Project description

KARA - Knowledge-Aware Re-embedding Algorithm

CI PyPI version Code style: ruff License: CC BY 4.0

KARA is a Python library for efficient document updates in RAG systems. It minimizes embedding operations by intelligently reusing existing chunks when documents are updated.

How It Works

KARA formulates chunking as a graph optimization problem:

  1. Creates a DAG where nodes are split positions and edges are potential chunks
  2. Uses Dijkstra's algorithm to find optimal chunking paths
  3. Automatically reuses existing chunks to minimize embedding costs

Typical efficiency gains: 70-90% fewer embeddings for document updates.

Installation

pip install kara-toolkit

# With LangChain integration
pip install kara-toolkit[langchain]

Quick Start

from kara import KARAUpdater, RecursiveCharacterChunker

# Initialize
chunker = RecursiveCharacterChunker(chunk_size=500)
updater = KARAUpdater(chunker=chunker, epsilon=0.1)

# Process initial documents
result = updater.create_knowledge_base(["Your document content..."])

# Update with new content - reuses existing chunks automatically
update_result = updater.update_knowledge_base(
    result.new_chunked_doc, 
    ["Updated document content..."]
)

print(f"Efficiency: {update_result.efficiency_ratio:.1%}")
print(f"Chunks reused: {update_result.num_reused}")

LangChain Integration

from kara.integrations.langchain import KARATextSplitter
from langchain_core.documents import Document

# Use as a drop-in replacement for LangChain text splitters
splitter = KARATextSplitter(chunk_size=300, epsilon=0.1)

docs = [Document(page_content="Your content...", metadata={"source": "file.pdf"})]
chunks = splitter.split_documents(docs)

Examples

See examples/ for complete usage examples.

License

CC BY 4.0 License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kara_toolkit-0.1.1.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kara_toolkit-0.1.1-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file kara_toolkit-0.1.1.tar.gz.

File metadata

  • Download URL: kara_toolkit-0.1.1.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kara_toolkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 669bfef3303bc33fbdb1ccd6aef22eb80249592878769dbae044f6b6b4c16d9b
MD5 9c53ef4b1bd87b5568f1052a371dc8df
BLAKE2b-256 d167f8e983486a1c96503a9200a177ca0b21c3e6056ea475cd9edb0e2cba1f32

See more details on using hashes here.

Provenance

The following attestation bundles were made for kara_toolkit-0.1.1.tar.gz:

Publisher: publish.yml on mzakizadeh/kara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kara_toolkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kara_toolkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kara_toolkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a1c99ffef2d3ec22c67103402e91482cec4a8fc3521fa39fc52c248c4578a89c
MD5 fc091fbc01a80d3d0e4a114160fb020e
BLAKE2b-256 e32bb7717d2f3cf01ef75e1b4a265893c2e949370487b0e2f478a132f12d10d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kara_toolkit-0.1.1-py3-none-any.whl:

Publisher: publish.yml on mzakizadeh/kara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page