Skip to main content

Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services

Project description

libkit

CI PyPI Python License: MIT

libkit is a toolkit for the retrieval half of RAG. It ingests documents (PDF, Markdown, Office), chunks and embeds them, stores everything in a single DuckDB file, and answers queries with hybrid search (vector + full-text, fused with RRF) plus optional reranking and attribute weighting.

There's no generation here — libkit gives you the building blocks to stand up a knowledge base for an LLM skill or an MCP service, with sensible defaults and an "it just works" entry point.

from libkit import Library

lib = await Library.open("corpus.duckdb")        # smart defaults
await lib.ingest("paper.pdf")                      # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
    print(h.score, h.chunk.text[:80])

Why libkit

  • Async-first, batteries-included. Library.open() wires up a recommended embedder, the standard loader map, persistent caching, and adaptive request coalescing — every piece overridable.
  • Hybrid retrieval. Dense vector search and DuckDB full-text BM25 run in parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder reranker and per-query attribute weighting refine the ranking.
  • One file, no services. Documents, chunks, vectors, and the FTS index all live in a single DuckDB database. No external vector DB to run.
  • Generic metadata. Four auto-filled top-level fields (source_url, content_type, title, date) plus a free-form metadata JSON column; filters and weights work over both.
  • Pluggable backends. Loaders, embedders, and rerankers are injected as protocol-conforming instances — bring your own, or use the bundled adapters (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab, pdfmux, LibreOffice).
  • Strictly typed. Ships py.typed; pyright-checked.

Install

pip install libkit            # or: uv add libkit

libkit's core is pure-Python with a small dependency set. Heavier or service-specific backends are opt-in extras:

Extra Pulls in For
pdf pdfmux Local PDF extraction
cohere cohere Cohere reranker
zeroentropy httpx ZeroEntropy hosted reranker
local-rerank sentence-transformers, accelerate In-process cross-encoder rerank
mcp mcp Serve a Library over MCP
fancychunk-torch / fancychunk-mlx / fancychunk-cuda fancychunk Local embedding/chunking
pip install "libkit[pdf,cohere,mcp]"

Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere, ZeroEntropy, Datalab) and read their keys from the environment (OPENAI_API_KEY, DEEPINFRA_API_KEY, DATALAB_API_KEY, …).

Quickstart

import asyncio
from libkit import Library, QueryWeights


async def main():
    # Smart defaults: remote bulk-ingest embeddings, local interactive query
    # embeddings, caching, and coalescing. db_path is the only requirement.
    lib = await Library.open(
        "corpus.duckdb",
        embedding="auto",        # "auto" | "local" | "remote"
        model="qwen3_600m",
    )

    # Ingest. Idempotent on content hash; the loader is chosen by extension.
    # The four top-level fields are auto-filled; override any via metadata=,
    # and add arbitrary keys (stored in the metadata JSON).
    await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
    await lib.ingest("notes.md")

    # Batch ingest yields a result per document as it finishes.
    async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
        if r.error:
            print("failed:", r.path, r.error)

    # Hybrid query with optional recency/attribute weighting and filters.
    results = await lib.query(
        "how does the cache eviction work?",
        limit=8,
        weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
        filters={"author": "Smith"},
    )
    for r in results:
        print(f"{r.score:.3f}  {r.chunk.source_url}\n      {r.chunk.text[:100]}")

    await lib.close()


asyncio.run(main())

Full control

Library.open() is a convenience over an explicit, frozen LibraryConfig:

from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader

lib = Library(
    LibraryConfig(
        db_path="corpus.duckdb",
        embedder=default_embedder(embedding="remote"),
        loaders={".md": MarkdownLoader()},
        chunk_size_tokens=512,
        chunk_overlap_tokens=64,
    )
)

Serve over MCP

from libkit import Library
from libkit.mcp import serve_mcp        # requires the `mcp` extra

lib = await Library.open("corpus.duckdb")
await serve_mcp(lib)                     # exposes ingest/query/get/list/delete tools

How it works

ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query  → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
         → optional rerank → attribute weighting → results

See docs/DESIGN.md for the full design — schema, the adaptive-concurrency pipeline, caching, and the correctness invariants.

Status

libkit is at 0.1 — the API is usable and tested, but may still shift before 1.0. Issues and PRs welcome; see CONTRIBUTING.md.

License

MIT © Sam Quigley

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libkit-0.1.0.tar.gz (526.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libkit-0.1.0-py3-none-any.whl (97.6 kB view details)

Uploaded Python 3

File details

Details for the file libkit-0.1.0.tar.gz.

File metadata

  • Download URL: libkit-0.1.0.tar.gz
  • Upload date:
  • Size: 526.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 716739639311fda0b8b7e08a7fb851576ffe7bc884869243cd58ceaa25ce5374
MD5 fb3462971604a169ac6824301c8f2e7d
BLAKE2b-256 f4f03cba17a5427f6ac51bfc573e09bbe57d9dad16971fd7431ad610cfae734e

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.1.0.tar.gz:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file libkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: libkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 97.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f8142eec7f0351f715a4a3630b597e3b944955a09bd589c02b76b1d890a9da0b
MD5 0cade4602de3c037eb8becbd52b3892f
BLAKE2b-256 85528d0182b6cdf000789a7129afccf80309f7368387e49947838070cf518304

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.1.0-py3-none-any.whl:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page