Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services
Project description
libkit
libkit is a toolkit for the retrieval half of RAG. It ingests documents (PDF, Markdown, Office), chunks and embeds them, stores everything in a single DuckDB file, and answers queries with hybrid search (vector + full-text, fused with RRF) plus optional reranking and attribute weighting.
There's no generation here — libkit gives you the building blocks to stand up a knowledge base for an LLM skill or an MCP service, with sensible defaults and an "it just works" entry point.
from libkit import Library
lib = await Library.open("corpus.duckdb") # smart defaults
await lib.ingest("paper.pdf") # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
print(h.score, h.chunk.text[:80])
Why libkit
- Async-first, batteries-included.
Library.open()wires up a recommended embedder, the standard loader map, persistent caching, and adaptive request coalescing — every piece overridable. - Hybrid retrieval. Dense vector search and DuckDB full-text BM25 run in parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder reranker and per-query attribute weighting refine the ranking.
- One file, no services. Documents, chunks, vectors, and the FTS index all live in a single DuckDB database. No external vector DB to run.
- Generic metadata. Four auto-filled top-level fields (
source_url,content_type,title,date) plus a free-formmetadataJSON column; filters and weights work over both. - Pluggable backends. Loaders, embedders, and rerankers are injected as protocol-conforming instances — bring your own, or use the bundled adapters (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab, pdfmux, LibreOffice).
- Strictly typed. Ships
py.typed;pyright-checked.
Install
pip install libkit # or: uv add libkit
libkit's core is pure-Python with a small dependency set. Heavier or service-specific backends are opt-in extras:
| Extra | Pulls in | For |
|---|---|---|
pdf |
pdfmux |
Local PDF extraction |
cohere |
cohere |
Cohere reranker |
zeroentropy |
httpx |
ZeroEntropy hosted reranker |
local-rerank |
sentence-transformers, accelerate |
In-process cross-encoder rerank |
mcp |
mcp |
Serve a Library over MCP |
fancychunk-torch / fancychunk-mlx / fancychunk-cuda |
fancychunk |
Local embedding/chunking |
pip install "libkit[pdf,cohere,mcp]"
Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere, ZeroEntropy, Datalab) and read their keys from the environment (
OPENAI_API_KEY,DEEPINFRA_API_KEY,DATALAB_API_KEY, …).
Quickstart
import asyncio
from libkit import Library, QueryWeights
async def main():
# Smart defaults: remote bulk-ingest embeddings, local interactive query
# embeddings, caching, and coalescing. db_path is the only requirement.
lib = await Library.open(
"corpus.duckdb",
embedding="auto", # "auto" | "local" | "remote"
model="qwen3_600m",
)
# Ingest. Idempotent on content hash; the loader is chosen by extension.
# The four top-level fields are auto-filled; override any via metadata=,
# and add arbitrary keys (stored in the metadata JSON).
await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
await lib.ingest("notes.md")
# Batch ingest yields a result per document as it finishes.
async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
if r.error:
print("failed:", r.path, r.error)
# Hybrid query with optional recency/attribute weighting and filters.
results = await lib.query(
"how does the cache eviction work?",
limit=8,
weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
filters={"author": "Smith"},
)
for r in results:
print(f"{r.score:.3f} {r.chunk.source_url}\n {r.chunk.text[:100]}")
await lib.close()
asyncio.run(main())
Full control
Library.open() is a convenience over an explicit, frozen LibraryConfig:
from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader
lib = Library(
LibraryConfig(
db_path="corpus.duckdb",
embedder=default_embedder(embedding="remote"),
loaders={".md": MarkdownLoader()},
chunk_size_tokens=512,
chunk_overlap_tokens=64,
)
)
Serve over MCP
from libkit import Library
from libkit.mcp import serve_mcp # requires the `mcp` extra
lib = await Library.open("corpus.duckdb")
await serve_mcp(lib) # exposes ingest/query/get/list/delete tools
How it works
ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
→ optional rerank → attribute weighting → results
See docs/DESIGN.md for the full design — schema, the
adaptive-concurrency pipeline, caching, and the correctness invariants.
Status
libkit is at 0.1 — the API is usable and tested, but may still shift before 1.0. Issues and PRs welcome; see CONTRIBUTING.md.
License
MIT © Sam Quigley
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file libkit-0.2.1.tar.gz.
File metadata
- Download URL: libkit-0.2.1.tar.gz
- Upload date:
- Size: 534.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7ebebaab8c29f32d8e6a29fc88db157ece72a95abd8fb61ef266141a92555f3
|
|
| MD5 |
4028fc9553c0edf545713e8ebdbcb171
|
|
| BLAKE2b-256 |
a6e37d9efead6ee04944682f5cef19a776c8c6f8baee639f7aae8f0b96a64af8
|
Provenance
The following attestation bundles were made for libkit-0.2.1.tar.gz:
Publisher:
release.yml on emerose/libkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
libkit-0.2.1.tar.gz -
Subject digest:
b7ebebaab8c29f32d8e6a29fc88db157ece72a95abd8fb61ef266141a92555f3 - Sigstore transparency entry: 1706106292
- Sigstore integration time:
-
Permalink:
emerose/libkit@9c9d8a1d31038198b6ee52ac49d12c8a142eb7af -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/emerose
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9c9d8a1d31038198b6ee52ac49d12c8a142eb7af -
Trigger Event:
release
-
Statement type:
File details
Details for the file libkit-0.2.1-py3-none-any.whl.
File metadata
- Download URL: libkit-0.2.1-py3-none-any.whl
- Upload date:
- Size: 101.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30142701c296356b48720122199069e1cd154d63e0d4aadc7bbee048872d4dd7
|
|
| MD5 |
583f7c55d86c4d5e4f1f4cea7d341b54
|
|
| BLAKE2b-256 |
22fb3054fb73b2ea19fc6fec53ee847e694327a56cb705b61915d311bb882123
|
Provenance
The following attestation bundles were made for libkit-0.2.1-py3-none-any.whl:
Publisher:
release.yml on emerose/libkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
libkit-0.2.1-py3-none-any.whl -
Subject digest:
30142701c296356b48720122199069e1cd154d63e0d4aadc7bbee048872d4dd7 - Sigstore transparency entry: 1706106411
- Sigstore integration time:
-
Permalink:
emerose/libkit@9c9d8a1d31038198b6ee52ac49d12c8a142eb7af -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/emerose
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9c9d8a1d31038198b6ee52ac49d12c8a142eb7af -
Trigger Event:
release
-
Statement type: