Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

emerose

These details have not been verified by PyPI

Project description

libkit

libkit is a toolkit for the retrieval half of RAG. It ingests documents (PDF, Markdown, Office), chunks and embeds them, stores everything in a single DuckDB file, and answers queries with hybrid search (vector + full-text, fused with RRF) plus optional reranking and attribute weighting.

There's no generation here — libkit gives you the building blocks to stand up a knowledge base for an LLM skill or an MCP service, with sensible defaults and an "it just works" entry point.

from libkit import Library

lib = await Library.open("corpus.duckdb")        # smart defaults
await lib.ingest("paper.pdf")                      # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
    print(h.score, h.chunk.text[:80])

Why libkit

Async-first, batteries-included. Library.open() wires up a recommended embedder, the standard loader map, persistent caching, and adaptive request coalescing — every piece overridable.
Hybrid retrieval. Dense vector search and DuckDB full-text BM25 run in parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder reranker and per-query attribute weighting refine the ranking.
One file, no services. Documents, chunks, vectors, and the FTS index all live in a single DuckDB database. No external vector DB to run.
Generic metadata. Four auto-filled top-level fields (source_url, content_type, title, date) plus a free-form metadata JSON column; filters and weights work over both.
Pluggable backends. Loaders, embedders, and rerankers are injected as protocol-conforming instances — bring your own, or use the bundled adapters (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab, pdfmux, LibreOffice).
Strictly typed. Ships py.typed; pyright-checked.

Install

pip install libkit            # or: uv add libkit

libkit's core is pure-Python with a small dependency set. Heavier or service-specific backends are opt-in extras:

Extra	Pulls in	For
`pdf`	`pdfmux`	Local PDF extraction
`cohere`	`cohere`	Cohere reranker
`zeroentropy`	`httpx`	ZeroEntropy hosted reranker
`local-rerank`	`sentence-transformers`, `accelerate`	In-process cross-encoder rerank
`mcp`	`mcp`	Serve a `Library` over MCP
`fancychunk-torch` / `fancychunk-mlx` / `fancychunk-cuda`	`fancychunk`	Local embedding/chunking

pip install "libkit[pdf,cohere,mcp]"

Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere, ZeroEntropy, Datalab) and read their keys from the environment (OPENAI_API_KEY, DEEPINFRA_API_KEY, DATALAB_API_KEY, …).

Quickstart

import asyncio
from libkit import Library, QueryWeights


async def main():
    # Smart defaults: remote bulk-ingest embeddings, local interactive query
    # embeddings, caching, and coalescing. db_path is the only requirement.
    lib = await Library.open(
        "corpus.duckdb",
        embedding="auto",        # "auto" | "local" | "remote"
        model="qwen3_600m",
    )

    # Ingest. Idempotent on content hash; the loader is chosen by extension.
    # The four top-level fields are auto-filled; override any via metadata=,
    # and add arbitrary keys (stored in the metadata JSON).
    await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
    await lib.ingest("notes.md")

    # Batch ingest yields a result per document as it finishes.
    async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
        if r.error:
            print("failed:", r.path, r.error)

    # Hybrid query with optional recency/attribute weighting and filters.
    results = await lib.query(
        "how does the cache eviction work?",
        limit=8,
        weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
        filters={"author": "Smith"},
    )
    for r in results:
        print(f"{r.score:.3f}  {r.chunk.source_url}\n      {r.chunk.text[:100]}")

    await lib.close()


asyncio.run(main())

Full control

Library.open() is a convenience over an explicit, frozen LibraryConfig:

from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader

lib = Library(
    LibraryConfig(
        db_path="corpus.duckdb",
        embedder=default_embedder(embedding="remote"),
        loaders={".md": MarkdownLoader()},
        chunk_size_tokens=512,
        chunk_overlap_tokens=64,
    )
)

Serve over MCP

from libkit import Library
from libkit.mcp import serve_mcp        # requires the `mcp` extra

lib = await Library.open("corpus.duckdb")
await serve_mcp(lib)                     # exposes ingest/query/get/list/delete tools

How it works

ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query  → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
         → optional rerank → attribute weighting → results

See docs/DESIGN.md for the full design — schema, the adaptive-concurrency pipeline, caching, and the correctness invariants.

Status

libkit is at 0.1 — the API is usable and tested, but may still shift before 1.0. Issues and PRs welcome; see CONTRIBUTING.md.

License

MIT © Sam Quigley

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

emerose

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.3

Jun 3, 2026

0.2.2

Jun 3, 2026

0.2.1

Jun 2, 2026

This version

0.2.0

Jun 2, 2026

0.1.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libkit-0.2.0.tar.gz (529.7 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

libkit-0.2.0-py3-none-any.whl (98.7 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file libkit-0.2.0.tar.gz.

File metadata

Download URL: libkit-0.2.0.tar.gz
Upload date: Jun 2, 2026
Size: 529.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ad3eeb454a41c2cd438e01a3444781df3505e2c40b090d84031f720c52548709`
MD5	`f206c1634fd2142a0359a2fab639c7fd`
BLAKE2b-256	`1453bed1a07dbd372afe6225877d0d38ad433246d1320e4361ac611f93a8e095`

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.0.tar.gz:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: libkit-0.2.0.tar.gz
- Subject digest: ad3eeb454a41c2cd438e01a3444781df3505e2c40b090d84031f720c52548709
- Sigstore transparency entry: 1705467088
- Sigstore integration time: Jun 2, 2026
Source repository:
- Permalink: emerose/libkit@f329f769df59b9edef461902dae7354487f7868a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/emerose
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f329f769df59b9edef461902dae7354487f7868a
- Trigger Event: release

File details

Details for the file libkit-0.2.0-py3-none-any.whl.

File metadata

Download URL: libkit-0.2.0-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 98.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for libkit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`edfdaea7f7d4d93282a382ed29fb0d656a7c75b9d9343c651ce8bd89595d4e71`
MD5	`51502ac44e22b59cbd6e58f96c7f301c`
BLAKE2b-256	`60b06cf194caa1e961e3b080209935b2c3b8c88f7c895258d0313dbc0e4f7772`

See more details on using hashes here.

Provenance

The following attestation bundles were made for libkit-0.2.0-py3-none-any.whl:

Publisher: release.yml on emerose/libkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: libkit-0.2.0-py3-none-any.whl
- Subject digest: edfdaea7f7d4d93282a382ed29fb0d656a7c75b9d9343c651ce8bd89595d4e71
- Sigstore transparency entry: 1705467149
- Sigstore integration time: Jun 2, 2026
Source repository:
- Permalink: emerose/libkit@f329f769df59b9edef461902dae7354487f7868a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/emerose
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f329f769df59b9edef461902dae7354487f7868a
- Trigger Event: release

libkit 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

libkit

Why libkit

Install

Quickstart

Full control

Serve over MCP

How it works

Status

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance