PyPI-friendly local knowledge base toolkit: parse PDFs, build vector stores, and query with a simple API.

These details have not been verified by PyPI

Project links

Project description

kbase

kbase 是一个面向 PyPI 的本地知识库工具库，目标是用最少代码完成三件事：

解析 PDF
构建向量库
检索查询

要求 Python 版本：>=3.11。

安装

最小安装（默认解析链路）：

pip install klynxbase

启用 Docling PDF 增强解析：

pip install "klynxbase[pdf]"

启用 sentence-transformers 嵌入：

pip install "klynxbase[embed]"

全量能力：

pip install "klynxbase[full]"

最简可运行示例

from kbase import build_kb_from_pdfs, load_kb

pipeline = build_kb_from_pdfs(
    name="demo_kb",
    input_dir="papers_to_parser",
    output_dir="kb_workspace",
)

mgr = load_kb(kb_path=pipeline.kb_path, kb_name="demo")
print(mgr.query("这批文档的核心结论是什么？", top_k=3))

分步 API（兼容旧调用）

from kbase import parse_pdfs, parse_single_pdf, create_vector_kb, KBManager

# 批量解析目录中的全部 PDF
parse_pdfs("papers_to_parser", "kb_workspace/parsed/demo")

# 解析单个 PDF（可不传 output_dir，此时只返回内存结果不落盘）
parse_single_pdf(
    file_path="papers_to_parser/paper_a.pdf",
    output_dir="kb_workspace/parsed/single",  # 可选
)

db_path = create_vector_kb(
    name="demo_kb",
    input_dir="kb_workspace/parsed/demo",
    output_dir="kb_workspace/vector/demo",
    overwrite=True,
)

mgr = KBManager()
mgr.add_kb("demo", db_path)
print(mgr.query_json("核心方法是什么？", top_k=3))

模型与解析策略

1) 嵌入模型（向量化）

可在构建和查询两侧显式传入 embedding_model，建议保持一致：

from kbase import build_kb_from_pdfs, load_kb

pipeline = build_kb_from_pdfs(
    name="demo_kb",
    input_dir="papers_to_parser",
    output_dir="kb_workspace",
    embedding_model="all-MiniLM-L6-v2",  # 也可传本地模型目录
)

mgr = load_kb(
    kb_path=pipeline.kb_path,
    kb_name="demo",
    embedding_model="all-MiniLM-L6-v2",
)

说明：

未安装 sentence-transformers 或模型不可用时，会自动回退到 hash embedding（可运行，但语义效果较弱）。
离线环境建议使用本地模型路径。

2) PDF 解析策略

可通过 parser_backend 和 image_backend 选择策略：

from kbase import parse_pdfs

stats = parse_pdfs(
    input_dir="papers_to_parser",
    output_dir="kb_workspace/parsed/demo",
    parser_backend="docling",      # auto | docling | pymupdf
    image_backend="pymupdf",       # auto | pymupdf | pypdfium2 | none
    extract_images=True,
)

建议：

速度优先：parser_backend=pymupdf, image_backend=none
通用默认：parser_backend=auto, image_backend=auto
复杂版面优先：parser_backend=docling, image_backend=auto

Docling 模型清单（图片/表格等）

当使用 parser_backend="docling" 时，Docling 可能会按场景加载以下模型组（名称以 docling-tools models 输出为准）：

layout：版面分析（段落、标题、区域检测）
tableformer：表格结构识别（表格相关核心）
code_formula：代码块与公式相关识别
picture_classifier：图片/图像区域分类
rapidocr：OCR 模型（图片文字识别）
easyocr：OCR 备选模型
smolvlm：轻量视觉语言模型能力（部分多模态场景）
smoldocling / smoldocling_mlx：Docling 轻量模型变体
granitedocling / granitedocling_mlx：Docling Granite 系列变体
granite_vision：Granite 视觉模型相关能力

说明：

不是所有模型都必须下载，按你的任务选择即可。
仅做文本提取时，可先下载 layout；如果需要表格，建议至少加上 tableformer；需要 OCR 时再加 rapidocr 或 easyocr。

示例：

docling-tools models download --models layout tableformer picture_classifier rapidocr --output-dir ./docling_models

设置 Docling 模型目录

你可以显式设置 Docling 的 artifacts_path，让程序从指定目录加载模型。

Python API

from kbase import parse_pdfs, build_kb_from_pdfs

parse_pdfs(
    input_dir="papers_to_parser",
    output_dir="kb_workspace/parsed/demo",
    parser_backend="docling",
    docling_artifacts_path="D:/models/docling",
)

build_kb_from_pdfs(
    name="demo_kb",
    input_dir="papers_to_parser",
    output_dir="kb_workspace",
    parser_backend="docling",
    docling_artifacts_path="D:/models/docling",
)

CLI

kbase parse --input papers_to_parser --output kb_workspace/parsed/demo --parser-backend docling --docling-artifacts-path D:/models/docling --json
kbase parse-one --file papers_to_parser/paper_a.pdf --docling-artifacts-path D:/models/docling --json
kbase build --name demo_kb --input papers_to_parser --output kb_workspace --parser-backend docling --docling-artifacts-path D:/models/docling --json

配置文件（json/yaml）

{
  "parse": {
    "input": "papers_to_parser",
    "output": "kb_workspace/parsed/demo",
    "parser_backend": "docling",
    "docling_artifacts_path": "D:/models/docling"
  },
  "parse_one": {
    "file": "papers_to_parser/paper_a.pdf",
    "output": "kb_workspace/parsed/single",
    "parser_backend": "docling",
    "docling_artifacts_path": "D:/models/docling"
  },
  "build": {
    "name": "demo_kb",
    "input": "papers_to_parser",
    "output": "kb_workspace",
    "parser_backend": "docling",
    "docling_artifacts_path": "D:/models/docling"
  }
}

CLI 用法

1) 批量解析目录

kbase parse --input papers_to_parser --output kb_workspace/parsed/demo --json

2) 解析单个 PDF（新增）

默认输出到 PDF 同目录：

kbase parse-one --file papers_to_parser/paper_a.pdf --json

输出到指定目录：

kbase parse-one --file papers_to_parser/paper_a.pdf --output kb_workspace/parsed/single --json

3) 一站式构建 KB

kbase build --name demo_kb --input papers_to_parser --output kb_workspace --json

4) 查询

kbase query --db kb_workspace/vector_db/demo_kb/chroma_store --q "关键方法是什么" --top-k 5 --json

5) 离线评估（可选）

kbase eval --db kb_workspace/vector_db/demo_kb/chroma_store --kb-name demo --input eval_set.jsonl --top-k 5 --json

结果格式说明

KBManager.query()：XML 字符串（兼容已有 agent 集成）。
KBManager.query_json()：结构化 JSON（适合常规 Python 程序）。

典型目录结构

kb_workspace/
  parsed/
    demo_kb/
      01_paper_20260309_120000/
        paper_text.md
        paper_parsed.json
  vector_db/
    demo_kb/
      chroma_store/

注意事项

knowledge_db/all-MiniLM-L6-v2 不会被打包进 PyPI 产物。
默认嵌入模型按需加载；离线环境建议传本地模型路径，或使用 hash embedding 回退。
如需可复现构建，建议固定 chunk_size/chunk_overlap/embedding_model。

更多教程见：docs/kbase_tutorial.md

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

klynxbase-0.1.0.tar.gz (26.8 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

klynxbase-0.1.0-py3-none-any.whl (26.8 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file klynxbase-0.1.0.tar.gz.

File metadata

Download URL: klynxbase-0.1.0.tar.gz
Upload date: Mar 9, 2026
Size: 26.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for klynxbase-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`90f40e47997a246d271a5c9c0dbb9f6991b74761da5a97f176cc06507fe03132`
MD5	`d6a2e556efbf0927f336d0dbc551acb4`
BLAKE2b-256	`611a9471244350e8626bca48221a588085474c27e99884d897ed94895743c7ae`

See more details on using hashes here.

File details

Details for the file klynxbase-0.1.0-py3-none-any.whl.

File metadata

Download URL: klynxbase-0.1.0-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 26.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for klynxbase-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aee02f860df88a8a6ab491bf6974056222ccdc30c8dce17ab2352fefe6480f3c`
MD5	`b4099b50f1fa4f1aa3fdf932a96f128b`
BLAKE2b-256	`81620ab5ca305abb9a721a4891f48832aef399d55d6e37762deefe28dbe95f98`

See more details on using hashes here.

klynxbase 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

kbase

安装

最简可运行示例

分步 API（兼容旧调用）

模型与解析策略

1) 嵌入模型（向量化）

2) PDF 解析策略

Docling 模型清单（图片/表格等）

设置 Docling 模型目录

Python API

CLI

配置文件（json/yaml）

CLI 用法

1) 批量解析目录

2) 解析单个 PDF（新增）

3) 一站式构建 KB

4) 查询

5) 离线评估（可选）

结果格式说明

典型目录结构

注意事项

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes