Skip to main content

Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services

Project description

libkit

CI PyPI Python License: MIT

libkit is a toolkit for the retrieval half of RAG. It ingests documents (PDF, Markdown, Office), chunks and embeds them, stores everything in a single DuckDB file, and answers queries with hybrid search (vector + full-text, fused with RRF) plus optional reranking and attribute weighting.

There's no generation here — libkit gives you the building blocks to stand up a knowledge base for an LLM skill or an MCP service, with sensible defaults and an "it just works" entry point.

from libkit import Library

lib = await Library.open("corpus.duckdb")        # smart defaults
await lib.ingest("paper.pdf")                      # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
    print(h.score, h.chunk.text[:80])

Why libkit

  • Async-first, batteries-included. Library.open() wires up a recommended embedder, the standard loader map, persistent caching, and adaptive request coalescing — every piece overridable.
  • Hybrid retrieval. Dense vector search and DuckDB full-text BM25 run in parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder reranker and per-query attribute weighting refine the ranking.
  • One file, no services. Documents, chunks, vectors, and the FTS index all live in a single DuckDB database. No external vector DB to run.
  • Generic metadata. Four auto-filled top-level fields (source_url, content_type, title, date) plus a free-form metadata JSON column; filters and weights work over both.
  • Pluggable backends. Loaders, embedders, and rerankers are injected as protocol-conforming instances — bring your own, or use the bundled adapters (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab, pdfmux, LibreOffice).
  • Strictly typed. Ships py.typed; pyright-checked.

Install

pip install libkit            # or: uv add libkit

libkit's core is pure-Python with a small dependency set. Heavier or service-specific backends are opt-in extras:

Extra Pulls in For
pdf pdfmux Local PDF extraction
cohere cohere Cohere reranker
zeroentropy httpx ZeroEntropy hosted reranker
local-rerank sentence-transformers, accelerate In-process cross-encoder rerank
mcp mcp Serve a Library over MCP
fancychunk-torch / fancychunk-mlx / fancychunk-cuda fancychunk Local embedding/chunking
pip install "libkit[pdf,cohere,mcp]"

Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere, ZeroEntropy, Datalab) and read their keys from the environment (OPENAI_API_KEY, DEEPINFRA_API_KEY, DATALAB_API_KEY, …).

Quickstart

import asyncio
from libkit import Library, QueryWeights


async def main():
    # Smart defaults: remote bulk-ingest embeddings, local interactive query
    # embeddings, caching, and coalescing. db_path is the only requirement.
    lib = await Library.open(
        "corpus.duckdb",
        embedding="auto",        # "auto" | "local" | "remote"
        model="qwen3_600m",
    )

    # Ingest. Idempotent on content hash; the loader is chosen by extension.
    # The four top-level fields are auto-filled; override any via metadata=,
    # and add arbitrary keys (stored in the metadata JSON).
    await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
    await lib.ingest("notes.md")

    # Batch ingest yields a result per document as it finishes.
    async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
        if r.error:
            print("failed:", r.path, r.error)

    # Hybrid query with optional recency/attribute weighting and filters.
    results = await lib.query(
        "how does the cache eviction work?",
        limit=8,
        weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
        filters={"author": "Smith"},
    )
    for r in results:
        print(f"{r.score:.3f}  {r.chunk.source_url}\n      {r.chunk.text[:100]}")

    await lib.close()


asyncio.run(main())

Full control

Library.open() is a convenience over an explicit, frozen LibraryConfig:

from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader

lib = Library(
    LibraryConfig(
        db_path="corpus.duckdb",
        embedder=default_embedder(embedding="remote"),
        loaders={".md": MarkdownLoader()},
        chunk_size_tokens=512,
        chunk_overlap_tokens=64,
    )
)

Serve over MCP

from libkit import Library
from libkit.mcp import serve_mcp        # requires the `mcp` extra

lib = await Library.open("corpus.duckdb")
await serve_mcp(lib)                     # exposes ingest/query/get/list/delete tools

How it works

ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query  → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
         → optional rerank → attribute weighting → results

See docs/DESIGN.md for the full design — schema, the adaptive-concurrency pipeline, caching, and the correctness invariants.

Status

libkit is at 0.1 — the API is usable and tested, but may still shift before 1.0. Issues and PRs welcome; see CONTRIBUTING.md.

License

MIT © Sam Quigley

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libkit-0.2.1.tar.gz (534.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libkit-0.2.1-py3-none-any.whl (101.2 kB view details)

Uploaded Python 3

File details

Details for the file libkit-0.2.1.tar.gz.

File metadata

  • Download URL: libkit-0.2.1.tar.gz
  • Upload date:
  • Size: 534.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b7ebebaab8c29f32d8e6a29fc88db157ece72a95abd8fb61ef266141a92555f3
MD5 4028fc9553c0edf545713e8ebdbcb171
BLAKE2b-256 a6e37d9efead6ee04944682f5cef19a776c8c6f8baee639f7aae8f0b96a64af8

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.1.tar.gz:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file libkit-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: libkit-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 101.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 30142701c296356b48720122199069e1cd154d63e0d4aadc7bbee048872d4dd7
MD5 583f7c55d86c4d5e4f1f4cea7d341b54
BLAKE2b-256 22fb3054fb73b2ea19fc6fec53ee847e694327a56cb705b61915d311bb882123

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.1-py3-none-any.whl:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page