PyPI-friendly local knowledge base toolkit: parse PDFs, build vector stores, and query with a simple API.
Project description
kbase
kbase 是一个面向 PyPI 的本地知识库工具库,目标是用最少代码完成三件事:
- 解析 PDF
- 构建向量库
- 检索查询
要求 Python 版本:>=3.11。
安装
最小安装(默认解析链路):
pip install klynxbase
启用 Docling PDF 增强解析:
pip install "klynxbase[pdf]"
启用 sentence-transformers 嵌入:
pip install "klynxbase[embed]"
全量能力:
pip install "klynxbase[full]"
最简可运行示例
from kbase import build_kb_from_pdfs, load_kb
pipeline = build_kb_from_pdfs(
name="demo_kb",
input_dir="papers_to_parser",
output_dir="kb_workspace",
)
mgr = load_kb(kb_path=pipeline.kb_path, kb_name="demo")
print(mgr.query("这批文档的核心结论是什么?", top_k=3))
分步 API(兼容旧调用)
from kbase import parse_pdfs, parse_single_pdf, create_vector_kb, KBManager
# 批量解析目录中的全部 PDF
parse_pdfs("papers_to_parser", "kb_workspace/parsed/demo")
# 解析单个 PDF(可不传 output_dir,此时只返回内存结果不落盘)
parse_single_pdf(
file_path="papers_to_parser/paper_a.pdf",
output_dir="kb_workspace/parsed/single", # 可选
)
db_path = create_vector_kb(
name="demo_kb",
input_dir="kb_workspace/parsed/demo",
output_dir="kb_workspace/vector/demo",
overwrite=True,
)
mgr = KBManager()
mgr.add_kb("demo", db_path)
print(mgr.query_json("核心方法是什么?", top_k=3))
模型与解析策略
1) 嵌入模型(向量化)
可在构建和查询两侧显式传入 embedding_model,建议保持一致:
from kbase import build_kb_from_pdfs, load_kb
pipeline = build_kb_from_pdfs(
name="demo_kb",
input_dir="papers_to_parser",
output_dir="kb_workspace",
embedding_model="all-MiniLM-L6-v2", # 也可传本地模型目录
)
mgr = load_kb(
kb_path=pipeline.kb_path,
kb_name="demo",
embedding_model="all-MiniLM-L6-v2",
)
说明:
- 未安装
sentence-transformers或模型不可用时,会自动回退到 hash embedding(可运行,但语义效果较弱)。 - 离线环境建议使用本地模型路径。
2) PDF 解析策略
可通过 parser_backend 和 image_backend 选择策略:
from kbase import parse_pdfs
stats = parse_pdfs(
input_dir="papers_to_parser",
output_dir="kb_workspace/parsed/demo",
parser_backend="docling", # auto | docling | pymupdf
image_backend="pymupdf", # auto | pymupdf | pypdfium2 | none
extract_images=True,
)
建议:
- 速度优先:
parser_backend=pymupdf, image_backend=none - 通用默认:
parser_backend=auto, image_backend=auto - 复杂版面优先:
parser_backend=docling, image_backend=auto
Docling 模型清单(图片/表格等)
当使用 parser_backend="docling" 时,Docling 可能会按场景加载以下模型组(名称以 docling-tools models 输出为准):
layout:版面分析(段落、标题、区域检测)tableformer:表格结构识别(表格相关核心)code_formula:代码块与公式相关识别picture_classifier:图片/图像区域分类rapidocr:OCR 模型(图片文字识别)easyocr:OCR 备选模型smolvlm:轻量视觉语言模型能力(部分多模态场景)smoldocling/smoldocling_mlx:Docling 轻量模型变体granitedocling/granitedocling_mlx:Docling Granite 系列变体granite_vision:Granite 视觉模型相关能力
说明:
- 不是所有模型都必须下载,按你的任务选择即可。
- 仅做文本提取时,可先下载
layout;如果需要表格,建议至少加上tableformer;需要 OCR 时再加rapidocr或easyocr。
示例:
docling-tools models download --models layout tableformer picture_classifier rapidocr --output-dir ./docling_models
设置 Docling 模型目录
你可以显式设置 Docling 的 artifacts_path,让程序从指定目录加载模型。
Python API
from kbase import parse_pdfs, build_kb_from_pdfs
parse_pdfs(
input_dir="papers_to_parser",
output_dir="kb_workspace/parsed/demo",
parser_backend="docling",
docling_artifacts_path="D:/models/docling",
)
build_kb_from_pdfs(
name="demo_kb",
input_dir="papers_to_parser",
output_dir="kb_workspace",
parser_backend="docling",
docling_artifacts_path="D:/models/docling",
)
CLI
kbase parse --input papers_to_parser --output kb_workspace/parsed/demo --parser-backend docling --docling-artifacts-path D:/models/docling --json
kbase parse-one --file papers_to_parser/paper_a.pdf --docling-artifacts-path D:/models/docling --json
kbase build --name demo_kb --input papers_to_parser --output kb_workspace --parser-backend docling --docling-artifacts-path D:/models/docling --json
配置文件(json/yaml)
{
"parse": {
"input": "papers_to_parser",
"output": "kb_workspace/parsed/demo",
"parser_backend": "docling",
"docling_artifacts_path": "D:/models/docling"
},
"parse_one": {
"file": "papers_to_parser/paper_a.pdf",
"output": "kb_workspace/parsed/single",
"parser_backend": "docling",
"docling_artifacts_path": "D:/models/docling"
},
"build": {
"name": "demo_kb",
"input": "papers_to_parser",
"output": "kb_workspace",
"parser_backend": "docling",
"docling_artifacts_path": "D:/models/docling"
}
}
CLI 用法
1) 批量解析目录
kbase parse --input papers_to_parser --output kb_workspace/parsed/demo --json
2) 解析单个 PDF(新增)
默认输出到 PDF 同目录:
kbase parse-one --file papers_to_parser/paper_a.pdf --json
输出到指定目录:
kbase parse-one --file papers_to_parser/paper_a.pdf --output kb_workspace/parsed/single --json
3) 一站式构建 KB
kbase build --name demo_kb --input papers_to_parser --output kb_workspace --json
4) 查询
kbase query --db kb_workspace/vector_db/demo_kb/chroma_store --q "关键方法是什么" --top-k 5 --json
5) 离线评估(可选)
kbase eval --db kb_workspace/vector_db/demo_kb/chroma_store --kb-name demo --input eval_set.jsonl --top-k 5 --json
结果格式说明
KBManager.query():XML 字符串(兼容已有 agent 集成)。KBManager.query_json():结构化 JSON(适合常规 Python 程序)。
典型目录结构
kb_workspace/
parsed/
demo_kb/
01_paper_20260309_120000/
paper_text.md
paper_parsed.json
vector_db/
demo_kb/
chroma_store/
注意事项
knowledge_db/all-MiniLM-L6-v2不会被打包进 PyPI 产物。- 默认嵌入模型按需加载;离线环境建议传本地模型路径,或使用 hash embedding 回退。
- 如需可复现构建,建议固定
chunk_size/chunk_overlap/embedding_model。
更多教程见:docs/kbase_tutorial.md
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file klynxbase-0.1.0.tar.gz.
File metadata
- Download URL: klynxbase-0.1.0.tar.gz
- Upload date:
- Size: 26.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90f40e47997a246d271a5c9c0dbb9f6991b74761da5a97f176cc06507fe03132
|
|
| MD5 |
d6a2e556efbf0927f336d0dbc551acb4
|
|
| BLAKE2b-256 |
611a9471244350e8626bca48221a588085474c27e99884d897ed94895743c7ae
|
File details
Details for the file klynxbase-0.1.0-py3-none-any.whl.
File metadata
- Download URL: klynxbase-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aee02f860df88a8a6ab491bf6974056222ccdc30c8dce17ab2352fefe6480f3c
|
|
| MD5 |
b4099b50f1fa4f1aa3fdf932a96f128b
|
|
| BLAKE2b-256 |
81620ab5ca305abb9a721a4891f48832aef399d55d6e37762deefe28dbe95f98
|