Skip to main content

Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services

Project description

libkit

CI PyPI Python License: MIT

libkit is a toolkit for the retrieval half of RAG. It ingests documents (PDF, Markdown, Office), chunks and embeds them, stores everything in a single DuckDB file, and answers queries with hybrid search (vector + full-text, fused with RRF) plus optional reranking and attribute weighting.

There's no generation here — libkit gives you the building blocks to stand up a knowledge base for an LLM skill or an MCP service, with sensible defaults and an "it just works" entry point.

from libkit import Library

lib = await Library.open("corpus.duckdb")        # smart defaults
await lib.ingest("paper.pdf")                      # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
    print(h.score, h.chunk.text[:80])

Why libkit

  • Async-first, batteries-included. Library.open() wires up a recommended embedder, the standard loader map, persistent caching, and adaptive request coalescing — every piece overridable.
  • Hybrid retrieval. Dense vector search and DuckDB full-text BM25 run in parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder reranker and per-query attribute weighting refine the ranking.
  • One file, no services. Documents, chunks, vectors, and the FTS index all live in a single DuckDB database. No external vector DB to run.
  • Generic metadata. Four auto-filled top-level fields (source_url, content_type, title, date) plus a free-form metadata JSON column; filters and weights work over both.
  • Pluggable backends. Loaders, embedders, and rerankers are injected as protocol-conforming instances — bring your own, or use the bundled adapters (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab, pdfmux, LibreOffice).
  • Strictly typed. Ships py.typed; pyright-checked.

Install

pip install libkit            # or: uv add libkit

libkit's core is pure-Python with a small dependency set. Heavier or service-specific backends are opt-in extras:

Extra Pulls in For
pdf pdfmux Local PDF extraction
cohere cohere Cohere reranker
zeroentropy httpx ZeroEntropy hosted reranker
local-rerank sentence-transformers, accelerate In-process cross-encoder rerank
mcp mcp Serve a Library over MCP
fancychunk-torch / fancychunk-mlx / fancychunk-cuda fancychunk Local embedding/chunking
pip install "libkit[pdf,cohere,mcp]"

Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere, ZeroEntropy, Datalab) and read their keys from the environment (OPENAI_API_KEY, DEEPINFRA_API_KEY, DATALAB_API_KEY, …).

Quickstart

import asyncio
from libkit import Library, QueryWeights


async def main():
    # Smart defaults: remote bulk-ingest embeddings, local interactive query
    # embeddings, caching, and coalescing. db_path is the only requirement.
    lib = await Library.open(
        "corpus.duckdb",
        embedding="auto",        # "auto" | "local" | "remote"
        model="qwen3_600m",
    )

    # Ingest. Idempotent on content hash; the loader is chosen by extension.
    # The four top-level fields are auto-filled; override any via metadata=,
    # and add arbitrary keys (stored in the metadata JSON).
    await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
    await lib.ingest("notes.md")

    # Batch ingest yields a result per document as it finishes.
    async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
        if r.error:
            print("failed:", r.path, r.error)

    # Hybrid query with optional recency/attribute weighting and filters.
    results = await lib.query(
        "how does the cache eviction work?",
        limit=8,
        weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
        filters={"author": "Smith"},
    )
    for r in results:
        print(f"{r.score:.3f}  {r.chunk.source_url}\n      {r.chunk.text[:100]}")

    await lib.close()


asyncio.run(main())

Full control

Library.open() is a convenience over an explicit, frozen LibraryConfig:

from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader

lib = Library(
    LibraryConfig(
        db_path="corpus.duckdb",
        embedder=default_embedder(embedding="remote"),
        loaders={".md": MarkdownLoader()},
        chunk_size_tokens=512,
        chunk_overlap_tokens=64,
    )
)

Serve over MCP

from libkit import Library
from libkit.mcp import serve_mcp        # requires the `mcp` extra

lib = await Library.open("corpus.duckdb")
await serve_mcp(lib)                     # exposes ingest/query/get/list/delete tools

How it works

ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query  → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
         → optional rerank → attribute weighting → results

See docs/DESIGN.md for the full design — schema, the adaptive-concurrency pipeline, caching, and the correctness invariants.

Status

libkit is at 0.1 — the API is usable and tested, but may still shift before 1.0. Issues and PRs welcome; see CONTRIBUTING.md.

License

MIT © Sam Quigley

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libkit-0.2.2.tar.gz (536.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libkit-0.2.2-py3-none-any.whl (102.2 kB view details)

Uploaded Python 3

File details

Details for the file libkit-0.2.2.tar.gz.

File metadata

  • Download URL: libkit-0.2.2.tar.gz
  • Upload date:
  • Size: 536.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.2.tar.gz
Algorithm Hash digest
SHA256 f5cf430911e49aaacb1fbd59ff356594342b6baa213e98defa01f5369dd9544b
MD5 184947d9eae45b8ea594d4ccbc993528
BLAKE2b-256 b3fb1ae1e5fcf2d5cbfc3a0659db7ce10f4552ce119ff28f3782eb4f38c91233

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.2.tar.gz:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file libkit-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: libkit-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 102.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 390b51d62c119a3989f6e765bad249597af73d71cb631e0dcd560dcd99a46270
MD5 0a8120e6dd94878ba08e9e7d893582f0
BLAKE2b-256 7a72c8ace9946ca2974c6409868b2041cb436d7e9b93121566d0afbf95e9b902

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.2-py3-none-any.whl:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page