Skip to main content

PyPI-friendly local knowledge base toolkit: parse PDFs, build vector stores, and query with a simple API.

Project description

kbase

kbase 是一个面向 PyPI 的本地知识库工具库,目标是用最少代码完成三件事:

  1. 解析 PDF
  2. 构建向量库
  3. 检索查询

要求 Python 版本:>=3.11

安装

最小安装(默认解析链路):

pip install klynxbase

启用 Docling PDF 增强解析:

pip install "klynxbase[pdf]"

启用 sentence-transformers 嵌入:

pip install "klynxbase[embed]"

全量能力:

pip install "klynxbase[full]"

最简可运行示例

from kbase import build_kb_from_pdfs, load_kb

pipeline = build_kb_from_pdfs(
    name="demo_kb",
    input_dir="papers_to_parser",
    output_dir="kb_workspace",
)

mgr = load_kb(kb_path=pipeline.kb_path, kb_name="demo")
print(mgr.query("这批文档的核心结论是什么?", top_k=3))

分步 API(兼容旧调用)

from kbase import parse_pdfs, parse_single_pdf, create_vector_kb, KBManager

# 批量解析目录中的全部 PDF
parse_pdfs("papers_to_parser", "kb_workspace/parsed/demo")

# 解析单个 PDF(可不传 output_dir,此时只返回内存结果不落盘)
parse_single_pdf(
    file_path="papers_to_parser/paper_a.pdf",
    output_dir="kb_workspace/parsed/single",  # 可选
)

db_path = create_vector_kb(
    name="demo_kb",
    input_dir="kb_workspace/parsed/demo",
    output_dir="kb_workspace/vector/demo",
    overwrite=True,
)

mgr = KBManager()
mgr.add_kb("demo", db_path)
print(mgr.query_json("核心方法是什么?", top_k=3))

模型与解析策略

1) 嵌入模型(向量化)

可在构建和查询两侧显式传入 embedding_model,建议保持一致:

from kbase import build_kb_from_pdfs, load_kb

pipeline = build_kb_from_pdfs(
    name="demo_kb",
    input_dir="papers_to_parser",
    output_dir="kb_workspace",
    embedding_model="all-MiniLM-L6-v2",  # 也可传本地模型目录
)

mgr = load_kb(
    kb_path=pipeline.kb_path,
    kb_name="demo",
    embedding_model="all-MiniLM-L6-v2",
)

说明:

  1. 未安装 sentence-transformers 或模型不可用时,会自动回退到 hash embedding(可运行,但语义效果较弱)。
  2. 离线环境建议使用本地模型路径。

2) PDF 解析策略

可通过 parser_backendimage_backend 选择策略:

from kbase import parse_pdfs

stats = parse_pdfs(
    input_dir="papers_to_parser",
    output_dir="kb_workspace/parsed/demo",
    parser_backend="docling",      # auto | docling | pymupdf
    image_backend="pymupdf",       # auto | pymupdf | pypdfium2 | none
    extract_images=True,
)

建议:

  1. 速度优先:parser_backend=pymupdf, image_backend=none
  2. 通用默认:parser_backend=auto, image_backend=auto
  3. 复杂版面优先:parser_backend=docling, image_backend=auto

Docling 模型清单(图片/表格等)

当使用 parser_backend="docling" 时,Docling 可能会按场景加载以下模型组(名称以 docling-tools models 输出为准):

  1. layout:版面分析(段落、标题、区域检测)
  2. tableformer:表格结构识别(表格相关核心)
  3. code_formula:代码块与公式相关识别
  4. picture_classifier:图片/图像区域分类
  5. rapidocr:OCR 模型(图片文字识别)
  6. easyocr:OCR 备选模型
  7. smolvlm:轻量视觉语言模型能力(部分多模态场景)
  8. smoldocling / smoldocling_mlx:Docling 轻量模型变体
  9. granitedocling / granitedocling_mlx:Docling Granite 系列变体
  10. granite_vision:Granite 视觉模型相关能力

说明:

  1. 不是所有模型都必须下载,按你的任务选择即可。
  2. 仅做文本提取时,可先下载 layout;如果需要表格,建议至少加上 tableformer;需要 OCR 时再加 rapidocreasyocr

示例:

docling-tools models download --models layout tableformer picture_classifier rapidocr --output-dir ./docling_models

设置 Docling 模型目录

你可以显式设置 Docling 的 artifacts_path,让程序从指定目录加载模型。

Python API

from kbase import parse_pdfs, build_kb_from_pdfs

parse_pdfs(
    input_dir="papers_to_parser",
    output_dir="kb_workspace/parsed/demo",
    parser_backend="docling",
    docling_artifacts_path="D:/models/docling",
)

build_kb_from_pdfs(
    name="demo_kb",
    input_dir="papers_to_parser",
    output_dir="kb_workspace",
    parser_backend="docling",
    docling_artifacts_path="D:/models/docling",
)

CLI

kbase parse --input papers_to_parser --output kb_workspace/parsed/demo --parser-backend docling --docling-artifacts-path D:/models/docling --json
kbase parse-one --file papers_to_parser/paper_a.pdf --docling-artifacts-path D:/models/docling --json
kbase build --name demo_kb --input papers_to_parser --output kb_workspace --parser-backend docling --docling-artifacts-path D:/models/docling --json

配置文件(json/yaml)

{
  "parse": {
    "input": "papers_to_parser",
    "output": "kb_workspace/parsed/demo",
    "parser_backend": "docling",
    "docling_artifacts_path": "D:/models/docling"
  },
  "parse_one": {
    "file": "papers_to_parser/paper_a.pdf",
    "output": "kb_workspace/parsed/single",
    "parser_backend": "docling",
    "docling_artifacts_path": "D:/models/docling"
  },
  "build": {
    "name": "demo_kb",
    "input": "papers_to_parser",
    "output": "kb_workspace",
    "parser_backend": "docling",
    "docling_artifacts_path": "D:/models/docling"
  }
}

CLI 用法

1) 批量解析目录

kbase parse --input papers_to_parser --output kb_workspace/parsed/demo --json

2) 解析单个 PDF(新增)

默认输出到 PDF 同目录:

kbase parse-one --file papers_to_parser/paper_a.pdf --json

输出到指定目录:

kbase parse-one --file papers_to_parser/paper_a.pdf --output kb_workspace/parsed/single --json

3) 一站式构建 KB

kbase build --name demo_kb --input papers_to_parser --output kb_workspace --json

4) 查询

kbase query --db kb_workspace/vector_db/demo_kb/chroma_store --q "关键方法是什么" --top-k 5 --json

5) 离线评估(可选)

kbase eval --db kb_workspace/vector_db/demo_kb/chroma_store --kb-name demo --input eval_set.jsonl --top-k 5 --json

结果格式说明

  1. KBManager.query():XML 字符串(兼容已有 agent 集成)。
  2. KBManager.query_json():结构化 JSON(适合常规 Python 程序)。

典型目录结构

kb_workspace/
  parsed/
    demo_kb/
      01_paper_20260309_120000/
        paper_text.md
        paper_parsed.json
  vector_db/
    demo_kb/
      chroma_store/

注意事项

  1. knowledge_db/all-MiniLM-L6-v2 不会被打包进 PyPI 产物。
  2. 默认嵌入模型按需加载;离线环境建议传本地模型路径,或使用 hash embedding 回退。
  3. 如需可复现构建,建议固定 chunk_size/chunk_overlap/embedding_model

更多教程见:docs/kbase_tutorial.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

klynxbase-0.1.0.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

klynxbase-0.1.0-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file klynxbase-0.1.0.tar.gz.

File metadata

  • Download URL: klynxbase-0.1.0.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for klynxbase-0.1.0.tar.gz
Algorithm Hash digest
SHA256 90f40e47997a246d271a5c9c0dbb9f6991b74761da5a97f176cc06507fe03132
MD5 d6a2e556efbf0927f336d0dbc551acb4
BLAKE2b-256 611a9471244350e8626bca48221a588085474c27e99884d897ed94895743c7ae

See more details on using hashes here.

File details

Details for the file klynxbase-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: klynxbase-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for klynxbase-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aee02f860df88a8a6ab491bf6974056222ccdc30c8dce17ab2352fefe6480f3c
MD5 b4099b50f1fa4f1aa3fdf932a96f128b
BLAKE2b-256 81620ab5ca305abb9a721a4891f48832aef399d55d6e37762deefe28dbe95f98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page