Knowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mzakizadeh

These details have not been verified by PyPI

Project links

Project description

KARA - Efficient RAG Knowledge Base Updates

KARA stands for Knowledge-Aware Reembedding Algorithm. The word "Kara" (کارآ) also means "efficient" in Persian.

KARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.

Installation

pip install kara-toolkit

# With LangChain integration
pip install kara-toolkit[langchain]

Key Parameters

Parameter	Type	Default	Description
`imperfect_chunk_tolerance`	`int`	`9`	Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones. - `0`: No tolerance; disables chunk reuse. - `1`: Prefers new chunk over two imperfect ones. - `9`: Balanced default. - `99+`: Maximizes reuse, less uniform sizes.
`chunk_size`	`int`	`500`	Target size (in characters) for each text chunk.
`separators`	`List[str]`	`["\n\n", "\n", " "]`	List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`.

Quick Start

from kara import KARAUpdater, RecursiveCharacterChunker

# Initialize
chunker = RecursiveCharacterChunker(chunk_size=500)
updater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)

# Process initial documents
result = updater.create_knowledge_base(["Your document content..."])

# Update with new content - reuses existing chunks automatically
update_result = updater.update_knowledge_base(
    result.new_chunked_doc,
    ["Updated document content..."]
)

print(f"Efficiency: {update_result.efficiency_ratio:.1%}")
print(f"Chunks reused: {update_result.num_reused}")

LangChain Integration

from kara.integrations.langchain import KARATextSplitter
from langchain_core.documents import Document

# Use as a drop-in replacement for LangChain text splitters
splitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)

docs = [Document(page_content="Your content...", metadata={"source": "file.pdf"})]
chunks = splitter.split_documents(docs)

Examples

See examples/ for complete usage examples.

How It Works

KARA formulates chunking as a graph optimization problem:

Creates a DAG where nodes are split positions and edges are potential chunks
Uses Dijkstra's algorithm to find optimal chunking paths
Automatically reuses existing chunks to minimize embedding costs

Limitations

While KARA provides significant efficiency improvements for knowledge base updates, there are some current limitations to be aware of:

Document Version Dependency: The biggest limitation is that you need to keep the last version of documents to identify reusable chunks. However, you may be able to reconstruct document content using saved chunks in your vector store to reduce storage overhead. When compared to LangChain's indexing solution (documented here), which maintains a separate SQL database for chunk hashes while being extremely inefficient, our approach is still superior.
Chunking Configuration Changes: You likely cannot change splitting configurations (chunk size, separator characters) between updates, as this may disrupt the algorithm's optimal solution. We have not yet tested the extent to which configuration changes impact performance.
No Chunk Overlap Support: We currently do not support overlapping chunks, but we are investigating whether this feature can be added in future versions.

Roadmap to 1.0.0

100% Test Coverage - Complete test suite with full coverage
Performance Benchmarks - Real-world efficiency testing
Framework Support - LlamaIndex, Haystack, and others
Complete Documentation - API reference, guides, and examples
Token-Based Optimal Chunking - Extend algorithm to support token-based chunking strategies

License

CC BY 4.0 License - see LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mzakizadeh

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jul 13, 2025

0.2.0

Jul 12, 2025

0.1.1

Jul 12, 2025

0.1.0

Jul 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kara_toolkit-0.2.1.tar.gz (34.6 kB view details)

Uploaded Jul 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kara_toolkit-0.2.1-py3-none-any.whl (13.1 kB view details)

Uploaded Jul 13, 2025 Python 3

File details

Details for the file kara_toolkit-0.2.1.tar.gz.

File metadata

Download URL: kara_toolkit-0.2.1.tar.gz
Upload date: Jul 13, 2025
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kara_toolkit-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`0bc083b26d35e545b0eee4f8308c0fbd9b64e625adb052dad856bdfbedfc6794`
MD5	`107598c1658f33368450a28290f5695f`
BLAKE2b-256	`8429b33b91f61aebef3d3a3ba19416e4c3b9a0555b75af91a7233456d59d57fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kara_toolkit-0.2.1.tar.gz:

Publisher: publish.yml on mzakizadeh/kara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kara_toolkit-0.2.1.tar.gz
- Subject digest: 0bc083b26d35e545b0eee4f8308c0fbd9b64e625adb052dad856bdfbedfc6794
- Sigstore transparency entry: 272992470
- Sigstore integration time: Jul 13, 2025
Source repository:
- Permalink: mzakizadeh/kara@b73445db08d99465e94725cc502e91e6231a191e
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/mzakizadeh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b73445db08d99465e94725cc502e91e6231a191e
- Trigger Event: release

File details

Details for the file kara_toolkit-0.2.1-py3-none-any.whl.

File metadata

Download URL: kara_toolkit-0.2.1-py3-none-any.whl
Upload date: Jul 13, 2025
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kara_toolkit-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`faa681936588c8526c7ed73bb57d02a4aba1f6d3faffb5dc06224414d4569080`
MD5	`905a81c42dbb7595bcf127d7d2673c59`
BLAKE2b-256	`c280b7ec634cc816d55fa8c61285f23a8dbc6aaead5e69cbb517bc273ff624ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kara_toolkit-0.2.1-py3-none-any.whl:

Publisher: publish.yml on mzakizadeh/kara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kara_toolkit-0.2.1-py3-none-any.whl
- Subject digest: faa681936588c8526c7ed73bb57d02a4aba1f6d3faffb5dc06224414d4569080
- Sigstore transparency entry: 272992472
- Sigstore integration time: Jul 13, 2025
Source repository:
- Permalink: mzakizadeh/kara@b73445db08d99465e94725cc502e91e6231a191e
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/mzakizadeh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b73445db08d99465e94725cc502e91e6231a191e
- Trigger Event: release

kara-toolkit 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

KARA - Efficient RAG Knowledge Base Updates

Installation

Key Parameters

Quick Start

LangChain Integration

Examples

How It Works

Limitations

Roadmap to 1.0.0

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance