Skip to main content

Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services

Project description

libkit

CI PyPI Python License: MIT

libkit is a toolkit for the retrieval half of RAG. It ingests documents (PDF, Markdown, Office), chunks and embeds them, stores everything in a single DuckDB file, and answers queries with hybrid search (vector + full-text, fused with RRF) plus optional reranking and attribute weighting.

There's no generation here — libkit gives you the building blocks to stand up a knowledge base for an LLM skill or an MCP service, with sensible defaults and an "it just works" entry point.

from libkit import Library

lib = await Library.open("corpus.duckdb")        # smart defaults
await lib.ingest("paper.pdf")                      # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
    print(h.score, h.chunk.text[:80])

Why libkit

  • Async-first, batteries-included. Library.open() wires up a recommended embedder, the standard loader map, persistent caching, and adaptive request coalescing — every piece overridable.
  • Hybrid retrieval. Dense vector search and DuckDB full-text BM25 run in parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder reranker and per-query attribute weighting refine the ranking.
  • One file, no services. Documents, chunks, vectors, and the FTS index all live in a single DuckDB database. No external vector DB to run.
  • Generic metadata. Four auto-filled top-level fields (source_url, content_type, title, date) plus a free-form metadata JSON column; filters and weights work over both.
  • Pluggable backends. Loaders, embedders, and rerankers are injected as protocol-conforming instances — bring your own, or use the bundled adapters (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab, pdfmux, LibreOffice).
  • Strictly typed. Ships py.typed; pyright-checked.

Install

pip install libkit            # or: uv add libkit

libkit's core is pure-Python with a small dependency set. Heavier or service-specific backends are opt-in extras:

Extra Pulls in For
pdf pdfmux Local PDF extraction
cohere cohere Cohere reranker
zeroentropy httpx ZeroEntropy hosted reranker
local-rerank sentence-transformers, accelerate In-process cross-encoder rerank
mcp mcp Serve a Library over MCP
fancychunk-torch / fancychunk-mlx / fancychunk-cuda fancychunk Local embedding/chunking
pip install "libkit[pdf,cohere,mcp]"

Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere, ZeroEntropy, Datalab) and read their keys from the environment (OPENAI_API_KEY, DEEPINFRA_API_KEY, DATALAB_API_KEY, …).

Quickstart

import asyncio
from libkit import Library, QueryWeights


async def main():
    # Smart defaults: remote bulk-ingest embeddings, local interactive query
    # embeddings, caching, and coalescing. db_path is the only requirement.
    lib = await Library.open(
        "corpus.duckdb",
        embedding="auto",        # "auto" | "local" | "remote"
        model="qwen3_600m",
    )

    # Ingest. Idempotent on content hash; the loader is chosen by extension.
    # The four top-level fields are auto-filled; override any via metadata=,
    # and add arbitrary keys (stored in the metadata JSON).
    await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
    await lib.ingest("notes.md")

    # Batch ingest yields a result per document as it finishes.
    async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
        if r.error:
            print("failed:", r.path, r.error)

    # Hybrid query with optional recency/attribute weighting and filters.
    results = await lib.query(
        "how does the cache eviction work?",
        limit=8,
        weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
        filters={"author": "Smith"},
    )
    for r in results:
        print(f"{r.score:.3f}  {r.chunk.source_url}\n      {r.chunk.text[:100]}")

    await lib.close()


asyncio.run(main())

Full control

Library.open() is a convenience over an explicit, frozen LibraryConfig:

from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader

lib = Library(
    LibraryConfig(
        db_path="corpus.duckdb",
        embedder=default_embedder(embedding="remote"),
        loaders={".md": MarkdownLoader()},
        chunk_size_tokens=512,
        chunk_overlap_tokens=64,
    )
)

Serve over MCP

from libkit import Library
from libkit.mcp import serve_mcp        # requires the `mcp` extra

lib = await Library.open("corpus.duckdb")
await serve_mcp(lib)                     # exposes ingest/query/get/list/delete tools

How it works

ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query  → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
         → optional rerank → attribute weighting → results

See docs/DESIGN.md for the full design — schema, the adaptive-concurrency pipeline, caching, and the correctness invariants.

Status

libkit is at 0.1 — the API is usable and tested, but may still shift before 1.0. Issues and PRs welcome; see CONTRIBUTING.md.

License

MIT © Sam Quigley

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libkit-0.2.0.tar.gz (529.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libkit-0.2.0-py3-none-any.whl (98.7 kB view details)

Uploaded Python 3

File details

Details for the file libkit-0.2.0.tar.gz.

File metadata

  • Download URL: libkit-0.2.0.tar.gz
  • Upload date:
  • Size: 529.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ad3eeb454a41c2cd438e01a3444781df3505e2c40b090d84031f720c52548709
MD5 f206c1634fd2142a0359a2fab639c7fd
BLAKE2b-256 1453bed1a07dbd372afe6225877d0d38ad433246d1320e4361ac611f93a8e095

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.0.tar.gz:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file libkit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: libkit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 98.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 edfdaea7f7d4d93282a382ed29fb0d656a7c75b9d9343c651ce8bd89595d4e71
MD5 51502ac44e22b59cbd6e58f96c7f301c
BLAKE2b-256 60b06cf194caa1e961e3b080209935b2c3b8c88f7c895258d0313dbc0e4f7772

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.0-py3-none-any.whl:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page