Storage for documents
Project description
doc_store
DocStore 是一个用于管理文档、页面、布局和内容的客户端库。本文档介绍 DocClient 的完整 API 使用方法。
目录
安装与配置
from doc_store import DocClient
快速开始
from doc_store import DocClient
# 创建客户端实例
client = DocClient(server_url="http://localhost:8000")
# 获取文档
doc = client.get_doc("doc-xxx")
# 遍历页面
for page in doc.pages:
print(page.id, page.image_path)
# 关闭客户端
client.close()
# 或使用上下文管理器
with DocClient(server_url="http://localhost:8000") as client:
doc = client.get_doc("doc-xxx")
数据模型
Doc(文档)
文档是最顶层的数据结构,代表一个 PDF 文件。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
文档唯一标识符 |
rid |
int |
数据库行 ID |
create_time |
int | None |
创建时间戳 |
update_time |
int | None |
更新时间戳 |
pdf_path |
str |
PDF 文件的 S3 路径 |
pdf_filename |
str | None |
PDF 文件名 |
pdf_filesize |
int |
PDF 文件大小(字节) |
pdf_hash |
str |
PDF 文件的 SHA256 哈希值 |
num_pages |
int |
PDF 页数 |
page_width |
float |
页面宽度 |
page_height |
float |
页面高度 |
metadata |
dict |
PDF 元数据 |
orig_path |
str | None |
原始文件路径(如 Word/PPT) |
orig_filesize |
int | None |
原始文件大小 |
orig_filename |
str | None |
原始文件名 |
orig_hash |
str | None |
原始文件哈希值 |
tags |
list[str] |
标签列表 |
attrs |
dict[str, AttrValueType] |
属性字典 |
metrics |
dict[str, MetricValueType] |
指标字典 |
动态属性
| 属性 | 类型 | 说明 |
|---|---|---|
pdf_bytes |
bytes |
获取 PDF 文件的二进制内容 |
pdf |
PDFDocument |
获取 PDF 文档对象 |
pages |
list[Page] |
获取文档的所有页面(按页码排序) |
方法
# 查找文档的页面
def find_pages(
self,
query: dict | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Page]
# 插入页面
def insert_page(self, page_idx: int, page_input: DocPageInput) -> Page
Page(页面)
页面代表文档中的单个页面。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
页面唯一标识符 |
rid |
int |
数据库行 ID |
create_time |
int | None |
创建时间戳 |
update_time |
int | None |
更新时间戳 |
doc_id |
str | None |
所属文档 ID |
page_idx |
int | None |
页码索引 |
image_path |
str |
页面图片的 S3 路径 |
image_filesize |
int |
图片文件大小(字节) |
image_hash |
str |
图片的 SHA256 哈希值 |
image_width |
int |
图片宽度(像素) |
image_height |
int |
图片高度(像素) |
image_dpi |
int | None |
图片 DPI |
providers |
list[str] |
已处理的 provider 列表 |
old_image_path |
str | None |
迁移前的旧图片路径 |
tags |
list[str] |
标签列表 |
attrs |
dict[str, AttrValueType] |
属性字典 |
metrics |
dict[str, MetricValueType] |
指标字典 |
动态属性
| 属性 | 类型 | 说明 |
|---|---|---|
image_bytes |
bytes |
获取页面图片的二进制内容 |
image |
PIL.Image.Image |
获取页面图片对象 |
image_presigned_link |
str |
获取页面图片的预签名链接(24小时有效) |
image_pub_link |
str |
获取页面图片的公开链接 |
super_block |
Block |
获取页面的超级块 |
doc |
Doc | None |
获取所属文档 |
方法
# 尝试获取布局
def try_get_layout(self, provider: str, expand: bool = False) -> Layout | None
# 获取布局
def get_layout(self, provider: str, expand: bool = False) -> Layout
# 查找布局
def find_layouts(
self,
query: dict | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Layout]
# 查找块
def find_blocks(
self,
query: dict | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Block]
# 查找内容
def find_contents(
self,
query: dict | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Content]
# 插入布局
def insert_layout(
self,
provider: str,
layout_input: LayoutInput,
insert_blocks: bool = False,
upsert: bool = False
) -> Layout
# 更新或插入布局
def upsert_layout(
self,
provider: str,
layout_input: LayoutInput,
insert_blocks: bool = False
) -> Layout
# 插入单个块
def insert_block(self, block_input: BlockInput) -> Block
# 批量插入块
def insert_blocks(self, blocks: list[BlockInput]) -> list[Block]
# 插入带内容的块布局
def insert_content_blocks_layout(
self,
provider: str,
content_blocks: list[ContentBlockInput],
upsert: bool = False,
) -> Layout
Layout(布局)
布局代表页面的结构分析结果,包含多个块。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
布局唯一标识符 |
rid |
int |
数据库行 ID |
create_time |
int | None |
创建时间戳 |
update_time |
int | None |
更新时间戳 |
page_id |
str |
所属页面 ID |
provider |
str |
布局提供者标识 |
masks |
list[MaskBlock] |
遮罩块列表 |
blocks |
list[Block] |
块列表 |
relations |
list[dict] |
块关系列表 |
contents |
list[Content] |
内容列表 |
is_human_label |
bool |
是否为人工标注 |
tags |
list[str] |
标签列表 |
attrs |
dict[str, AttrValueType] |
属性字典 |
metrics |
dict[str, MetricValueType] |
指标字典 |
动态属性
| 属性 | 类型 | 说明 |
|---|---|---|
page |
Page |
获取所属页面 |
masked_image |
PIL.Image.Image |
获取带遮罩的页面图片 |
framed_image |
PIL.Image.Image |
获取带块边框的页面图片 |
方法
# 列出所有内容版本
def list_versions(self) -> list[str]
# 列出所有块
def list_blocks(self) -> list[Block]
# 列出指定版本的内容
def list_contents(self, version: str | None = None) -> list[Content]
# 展开布局(加载所有块和内容)
def expand(self) -> Layout
Block(块)
块代表页面上的一个区域,如标题、段落、表格、图片等。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
块唯一标识符 |
rid |
int |
数据库行 ID |
create_time |
int | None |
创建时间戳 |
update_time |
int | None |
更新时间戳 |
layout_id |
str | None |
所属布局 ID |
provider |
str | None |
块提供者标识 |
page_id |
str | None |
所属页面 ID |
type |
str |
块类型(如 title, text, table, image) |
bbox |
list[float] |
归一化边界框 [x1, y1, x2, y2],范围 0-1 |
angle |
Literal[None, 0, 90, 180, 270] |
旋转角度 |
score |
float | None |
检测置信度分数 |
image_path |
str | None |
独立块的图片路径 |
image_filesize |
int | None |
图片文件大小 |
image_hash |
str | None |
图片哈希值 |
image_width |
int | None |
图片宽度 |
image_height |
int | None |
图片高度 |
versions |
list[str] |
已有的内容版本列表 |
tags |
list[str] |
标签列表 |
attrs |
dict[str, AttrValueType] |
属性字典 |
metrics |
dict[str, MetricValueType] |
指标字典 |
动态属性
| 属性 | 类型 | 说明 |
|---|---|---|
page |
Page | None |
获取所属页面 |
image |
PIL.Image.Image |
获取块图片(裁剪自页面或独立图片) |
image_bytes |
bytes |
获取块图片的二进制内容 |
image_pub_link |
str |
获取块图片的公开链接 |
方法
# 尝试获取内容
def try_get_content(self, version: str) -> Content | None
# 获取内容
def get_content(self, version: str) -> Content
# 查找内容
def find_contents(
self,
query: dict | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Content]
# 插入内容
def insert_content(
self,
version: str,
content_input: ContentInput,
upsert: bool = False
) -> Content
# 更新或插入内容
def upsert_content(self, version: str, content_input: ContentInput) -> Content
Content(内容)
内容代表块的识别结果,如 OCR 文本、LaTeX 公式等。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
内容唯一标识符 |
rid |
int |
数据库行 ID |
create_time |
int | None |
创建时间戳 |
update_time |
int | None |
更新时间戳 |
page_id |
str | None |
所属页面 ID |
block_id |
str |
所属块 ID |
version |
str |
内容版本(如 ocr-v1, llm-gpt4) |
format |
str |
内容格式(如 text, html, latex, markdown) |
content |
str |
内容文本 |
is_human_label |
bool |
是否为人工标注 |
tags |
list[str] |
标签列表 |
attrs |
dict[str, AttrValueType] |
属性字典 |
metrics |
dict[str, MetricValueType] |
指标字典 |
动态属性
| 属性 | 类型 | 说明 |
|---|---|---|
page |
Page | None |
获取所属页面 |
block |
Block |
获取所属块 |
Value(值)
Value 用于存储元素的扩展数据,如嵌入向量等。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
值唯一标识符 |
rid |
int |
数据库行 ID |
elem_id |
str |
所属元素 ID |
key |
str |
键名 |
type |
str |
值类型(如 ndarray, json) |
value |
Any |
值内容 |
动态属性
| 属性 | 类型 | 说明 |
|---|---|---|
elem |
DocElement |
获取所属元素 |
方法
# 解码值(如将 base64 编码的 ndarray 解码)
def decode(self) -> Value
Task(任务)
任务用于异步处理文档的各种操作。
属性
| 属性 | 类型 | 说明 |
|---|---|---|
id |
str |
任务唯一标识符 |
rid |
int |
数据库行 ID |
target |
str |
目标元素 ID |
batch_id |
str |
批次 ID |
command |
str |
命令名称 |
args |
dict[str, Any] |
命令参数 |
priority |
int |
优先级 |
status |
str |
任务状态 |
create_user |
str |
创建用户 |
update_user |
str | None |
更新用户 |
error_message |
str | None |
错误信息 |
DocClient API
初始化
client = DocClient(
server_url: str | None = None, # 服务器 URL
prefix: str = "/api/v1", # API 路径前缀
timeout: int = 300, # 读取超时(秒)
connect_timeout: int = 30, # 连接超时(秒)
decode_value: bool = True, # 是否自动解码 Value
)
健康检查
# 检查服务器健康状态
def health_check(self, show_stats: bool = False) -> dict
读取操作
获取单个元素
# 获取文档
def get_doc(self, doc_id: str) -> Doc
def get_doc_by_pdf_path(self, pdf_path: str) -> Doc
def get_doc_by_pdf_hash(self, pdf_hash: str) -> Doc
# 获取页面
def get_page(self, page_id: str) -> Page
def get_page_by_image_path(self, image_path: str) -> Page
def get_page_by_image_hash(self, image_hash: str) -> Page
# 获取布局
def get_layout(self, layout_id: str, expand: bool = False) -> Layout
def get_layout_by_page_id_and_provider(
self, page_id: str, provider: str, expand: bool = False
) -> Layout
# 获取块
def get_block(self, block_id: str) -> Block
def get_block_by_image_path(self, image_path: str) -> Block
def get_super_block(self, page_id: str) -> Block
# 获取内容
def get_content(self, content_id: str) -> Content
def get_content_by_block_id_and_version(self, block_id: str, version: str) -> Content
# 获取值
def get_value(self, value_id: str) -> Value
def get_value_by_elem_id_and_key(self, elem_id: str, key: str) -> Value
# 获取评估布局/内容
def get_eval_layout(self, eval_layout_id: str) -> EvalLayout
def get_eval_content(self, eval_content_id: str) -> EvalContent
Try 方法(不存在时返回 None)
def try_get(self, elem_id: str) -> DocElement | None
def try_get_doc(self, doc_id: str) -> Doc | None
def try_get_doc_by_pdf_path(self, pdf_path: str) -> Doc | None
def try_get_doc_by_pdf_hash(self, pdf_hash: str) -> Doc | None
def try_get_page(self, page_id: str) -> Page | None
def try_get_page_by_image_path(self, image_path: str) -> Page | None
def try_get_page_by_image_hash(self, image_hash: str) -> Page | None
def try_get_layout(self, layout_id: str, expand: bool = False) -> Layout | None
def try_get_layout_by_page_id_and_provider(
self, page_id: str, provider: str, expand: bool = False
) -> Layout | None
def try_get_block(self, block_id: str) -> Block | None
def try_get_block_by_image_path(self, image_path: str) -> Block | None
def try_get_content(self, content_id: str) -> Content | None
def try_get_content_by_block_id_and_version(
self, block_id: str, version: str
) -> Content | None
def try_get_value(self, value_id: str) -> Value | None
def try_get_value_by_elem_id_and_key(self, elem_id: str, key: str) -> Value | None
def try_get_user(self, name: str) -> User | None
def try_get_task(self, task_id: str) -> Task | None
查询元素
# 通用查找方法(返回流式迭代器)
def find(
self,
elem_type: ElemType | type, # 元素类型: "doc", "page", "layout", "block", "content", "value"
query: dict | list[dict] | None = None, # 查询条件(MongoDB 风格)
query_from: ElemType | type | None = None, # 从指定类型查询
skip: int | None = None, # 跳过数量
limit: int | None = None, # 限制数量
) -> Iterable[Element]
# 统计元素数量
def count(
self,
elem_type: ElemType | type,
query: dict | list[dict] | None = None,
query_from: ElemType | type | None = None,
estimated: bool = False, # 是否使用估算
) -> int
# 获取字段的唯一值列表
def distinct_values(
self,
elem_type: ElemType,
field: Literal["tags", "providers", "provider", "versions", "version"],
query: dict | None = None,
) -> list[str]
类型特定的查找方法
def find_docs(
self,
query: dict | list[dict] | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Doc]
def find_pages(
self,
query: dict | list[dict] | None = None,
doc_id: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Page]
def find_layouts(
self,
query: dict | list[dict] | None = None,
page_id: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Layout]
def find_blocks(
self,
query: dict | list[dict] | None = None,
page_id: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Block]
def find_contents(
self,
query: dict | list[dict] | None = None,
page_id: str | None = None,
block_id: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Content]
def find_values(
self,
query: dict | list[dict] | None = None,
elem_id: str | None = None,
key: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Value]
获取唯一值列表的便捷方法
def doc_tags(self) -> list[str]
def page_tags(self) -> list[str]
def page_providers(self) -> list[str]
def layout_providers(self) -> list[str]
def layout_tags(self) -> list[str]
def block_tags(self) -> list[str]
def block_versions(self) -> list[str]
def content_versions(self) -> list[str]
def content_tags(self) -> list[str]
写入操作
插入元素
# 插入文档
def insert_doc(self, doc_input: DocInput, skip_ext_check: bool = False) -> Doc
# 插入页面
def insert_page(self, page_input: PageInput, skip_hash_check: bool = False) -> Page
# 插入布局
def insert_layout(
self,
page_id: str,
provider: str,
layout_input: LayoutInput,
insert_blocks: bool = False,
upsert: bool = False
) -> Layout
# 插入块
def insert_block(self, page_id: str, block_input: BlockInput) -> Block
def insert_blocks(self, page_id: str, blocks: list[BlockInput]) -> list[Block]
def insert_standalone_block(self, block_input: StandaloneBlockInput) -> Block
# 插入内容
def insert_content(
self,
block_id: str,
version: str,
content_input: ContentInput,
upsert: bool = False
) -> Content
# 插入带内容的块布局
def insert_content_blocks_layout(
self,
page_id: str,
provider: str,
content_blocks: list[ContentBlockInput],
upsert: bool = False,
) -> Layout
# 插入值
def insert_value(self, elem_id: str, key: str, value_input: ValueInput) -> Value
# 插入评估布局/内容
def insert_eval_layout(
self,
layout_id: str,
provider: str,
blocks: list[EvalLayoutBlock] | None = None,
relations: list[dict] | None = None,
) -> EvalLayout
def insert_eval_content(
self,
content_id: str,
version: str,
format: str,
content: str,
) -> EvalContent
便捷插入方法(从本地文件)
# 上传本地文件并插入文档
def insert_local_doc(self, local_pdf_path: str) -> Doc
# 上传本地图片并插入页面
def insert_local_page(self, local_image_path: str) -> Page
# 上传本地图片并插入独立块
def insert_local_block(self, type: str, local_image_path: str) -> Block
Upsert 方法(更新或插入)
def upsert_layout(
self, page_id: str, provider: str, layout_input: LayoutInput, insert_blocks: bool = False
) -> Layout
def upsert_content(self, block_id: str, version: str, content_input: ContentInput) -> Content
标签与属性操作
单个元素操作
# 标签操作
def add_tag(self, elem_id: str, tag: str) -> None
def del_tag(self, elem_id: str, tag: str) -> None
# 属性操作
def add_attr(self, elem_id: str, name: str, attr_input: AttrInput) -> None
def add_attrs(self, elem_id: str, attrs: dict[str, AttrValueType]) -> None
def del_attr(self, elem_id: str, name: str) -> None
# 指标操作
def add_metric(self, elem_id: str, name: str, metric_input: MetricInput) -> None
def del_metric(self, elem_id: str, name: str) -> None
# 批量标签/属性/指标操作
def tagging(self, elem_id: str, tagging_input: TaggingInput) -> None
批量操作
def batch_add_tag(self, elem_type: ElemType, tag: str, elem_ids: list[str]) -> None
def batch_del_tag(self, elem_type: ElemType, tag: str, elem_ids: list[str]) -> None
def batch_tagging(self, elem_type: ElemType, inputs: list[TaggingInput]) -> None
任务操作
# 列出任务
def list_tasks(
self,
query: dict | None = None,
target: str | None = None,
batch_id: str | None = None,
command: str | None = None,
status: str | None = None,
create_user: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> list[Task]
# 获取任务
def get_task(self, task_id: str) -> Task
# 插入任务
def insert_task(self, target_id: str, task_input: TaskInput) -> Task
# 抓取新任务(用于任务处理器)
def grab_new_tasks(self, command: str, num: int = 10, hold_sec: int = 3600) -> list[Task]
def grab_new_task(self, command: str, hold_sec: int = 3600) -> Task | None
# 更新任务状态
def update_task(
self,
task_id: str,
command: str,
status: Literal["done", "error", "skipped"],
error_message: str | None = None,
check_mismatch: bool = False,
task: Task | None = None,
) -> None
def update_grabbed_task(
self,
task: Task,
status: Literal["done", "error", "skipped"],
error_message: str | None = None,
) -> None
# 统计任务
def count_tasks(self, command: str | None = None) -> list[TaskCount]
管理操作
用户管理
def list_users(self) -> list[User]
def get_user(self, name: str) -> User
def insert_user(self, user_input: UserInput) -> User
def update_user(self, name: str, user_update: UserUpdate) -> User
已知名称管理(标签/属性/指标定义)
def list_known_names(self) -> list[KnownName]
def insert_known_name(self, known_name_input: KnownNameInput) -> KnownName
def update_known_name(self, name: str, known_name_update: KnownNameUpdate) -> KnownName
def add_known_option(self, attr_name: str, option_name: str, option_input: KnownOptionInput) -> None
def del_known_option(self, attr_name: str, option_name: str) -> None
S3 存储桶管理
def list_s3_buckets(self) -> list[S3Bucket]
嵌入向量操作
# 列出嵌入模型
def list_embedding_models(self) -> list[EmbeddingModel]
def get_embedding_model(self, name: str) -> EmbeddingModel
def insert_embedding_model(self, embedding_model: EmbeddingModel) -> EmbeddingModel
def update_embedding_model(self, name: str, update: EmbeddingModelUpdate) -> EmbeddingModel
# 添加嵌入向量
def add_embeddings(
self,
elem_type: EmbeddableElemType, # "page" | "block"
model: str,
embeddings: list[EmbeddingInput]
) -> None
# 搜索嵌入向量
def search_embeddings(
self,
elem_type: EmbeddableElemType,
model: str,
query: EmbeddingQuery
) -> list[Embedding]
触发器操作
def list_triggers(self) -> list[Trigger]
def get_trigger(self, trigger_id: str) -> Trigger
def insert_trigger(self, trigger_input: TriggerInput) -> Trigger
def update_trigger(self, trigger_id: str, trigger_input: TriggerInput) -> Trigger
def delete_trigger(self, trigger_id: str) -> None
文件操作
# 读取文件
def read_file(self, file_path: str, allow_local: bool = True) -> bytes
# 读取图片
def read_image(self, file_path: str) -> PIL.Image.Image
# 获取文件大小
def stat_file_size(self, file_path: str, allow_local: bool = True) -> int
# 上传本地文件到 S3
def upload_local_file(
self,
file_type: Literal["doc", "page", "block"],
file_path: str
) -> str
# 获取 S3 客户端
def get_s3_client(self, path: str) -> boto3.client
便捷方法
通用获取方法
# 根据 ID 前缀自动判断类型并获取元素
def get(self, elem_id: str) -> DocElement
迭代处理
# 并行迭代处理元素
def iterate(
self,
elem_type: ElemType | type,
func: Callable[[int, DocElement], None],
query: dict | list[dict] | None = None,
query_from: ElemType | type | None = None,
max_workers: int = 10,
total: int | None = None,
) -> None
异常处理
DocClient 定义了以下异常类型:
| 异常类 | 说明 |
|---|---|
ElementNotFoundError |
元素不存在 |
NotFoundError |
资源不存在(通用) |
DocExistsError |
文档已存在(包含 pdf_path 和 pdf_hash) |
PageExistsError |
页面已存在(包含 image_path 和 image_hash) |
ElementExistsError |
元素已存在 |
AlreadyExistsError |
资源已存在(通用) |
UnauthorizedError |
未授权 |
TaskMismatchError |
任务状态不匹配 |
使用示例:
from doc_store.interface import ElementNotFoundError, DocExistsError
try:
doc = client.get_doc("doc-xxx")
except ElementNotFoundError:
print("文档不存在")
try:
doc = client.insert_doc(DocInput(pdf_path="s3://..."))
except DocExistsError as e:
print(f"文档已存在: {e.pdf_path}, hash: {e.pdf_hash}")
doc = client.get_doc_by_pdf_hash(e.pdf_hash)
输入模型
DocInput
class DocInput:
pdf_path: str # PDF 文件的 S3 路径(必填)
pdf_filename: str | None = None # PDF 文件名
orig_path: str | None = None # 原始文件路径(Word/PPT)
orig_filename: str | None = None # 原始文件名
tags: list[str] | None = None # 标签列表
PageInput
class PageInput:
image_path: str # 图片的 S3 路径(必填)
image_dpi: int | None = None # 图片 DPI
doc_id: str | None = None # 所属文档 ID
page_idx: int | None = None # 页码索引
tags: list[str] | None = None # 标签列表
LayoutInput
class LayoutInput:
blocks: list[ContentBlockInput] # 块列表(必填)
masks: list[MaskBlock] = [] # 遮罩列表
relations: list[dict] | None = None # 关系列表
is_human_label: bool = False # 是否人工标注
tags: list[str] | None = None # 标签列表
BlockInput
class BlockInput:
type: str # 块类型(必填)
bbox: list[float] # 归一化边界框 [x1, y1, x2, y2](必填)
angle: Literal[0, 90, 180, 270] | None = None # 旋转角度
score: float | None = None # 置信度分数
tags: list[str] | None = None # 标签列表
ContentBlockInput
class ContentBlockInput(BlockInput):
format: str | None = None # 内容格式
content: str | None = None # 内容文本
content_tags: list[str] | None = None # 内容标签
StandaloneBlockInput
class StandaloneBlockInput:
type: str # 块类型(必填)
image_path: str # 图片的 S3 路径(必填)
tags: list[str] | None = None # 标签列表
ContentInput
class ContentInput:
format: str # 内容格式(必填)
content: str # 内容文本(必填)
is_human_label: bool = False # 是否人工标注
tags: list[str] | None = None # 标签列表
ValueInput
class ValueInput:
value: Any # 值(必填)
type: str | None = None # 值类型
AttrInput
class AttrInput:
value: str | list[str] | int | bool # 属性值(必填)
MetricInput
class MetricInput:
value: float | int # 指标值(必填)
TaggingInput
class TaggingInput:
elem_id: str | None = None # 元素 ID(批量操作时必填)
tags: list[str] | None = None # 要添加的标签
attrs: dict[str, AttrValueType] | None = None # 要添加的属性
metrics: dict[str, MetricValueType] | None = None # 要添加的指标
del_tags: list[str] | None = None # 要删除的标签
del_attrs: list[str] | None = None # 要删除的属性
del_metrics: list[str] | None = None # 要删除的指标
TaskInput
class TaskInput:
command: str # 命令名称(必填)
args: dict[str, Any] | None = None # 命令参数
priority: int = 0 # 优先级
batch_id: str | None = None # 批次 ID
EmbeddingInput
class EmbeddingInput:
elem_id: str # 元素 ID(必填)
vector: list[float] # 嵌入向量(必填)
EmbeddingQuery
class EmbeddingQuery:
vector: list[float] # 查询向量(必填)
k: int # 返回数量(必填)
show_vector: bool = False # 是否返回向量
TriggerInput
class TriggerInput:
name: str # 触发器名称(必填)
description: str # 描述(必填)
condition: TriggerCondition # 触发条件(必填)
actions: list[TriggerAction] # 触发动作列表(必填)
disabled: bool = False # 是否禁用
display_order: float | None = None # 显示顺序
元素通用方法
所有继承自 DocElement 的对象(Doc, Page, Layout, Block, Content)都具有以下通用方法:
标签操作
def add_tag(self, tag: str) -> None
def del_tag(self, tag: str) -> None
属性操作
def add_attr(self, name: str, attr_input: AttrInput) -> None
def add_attrs(self, attrs: dict[str, AttrValueType]) -> None
def del_attr(self, name: str) -> None
指标操作
def add_metric(self, name: str, metric_input: MetricInput) -> None
def del_metric(self, name: str) -> None
批量标签/属性/指标操作
def tagging(self, tagging_input: TaggingInput) -> None
Value 操作
def try_get_value(self, key: str) -> Value | None
def get_value(self, key: str) -> Value
def find_values(
self,
query: dict | None = None,
skip: int | None = None,
limit: int | None = None,
) -> Iterable[Value]
def insert_value(self, key: str, value_input: ValueInput) -> Value
Task 操作
def list_tasks(
self,
query: dict | None = None,
batch_id: str | None = None,
command: str | None = None,
status: str | None = None,
create_user: str | None = None,
skip: int | None = None,
limit: int | None = None,
) -> list[Task]
def insert_task(self, task_input: TaskInput) -> Task
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc_store-0.8.0.tar.gz.
File metadata
- Download URL: doc_store-0.8.0.tar.gz
- Upload date:
- Size: 7.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7797e968f742ff9a202e5df6909fdd854189f3a9c953789a79a564a66caf0a17
|
|
| MD5 |
c80becab59c54c2794a8dd069f5c8f52
|
|
| BLAKE2b-256 |
fe998233e0d409f6c1ed982358a991e308d39f34d3fd96519f83ae572dc757ce
|
File details
Details for the file doc_store-0.8.0-py3-none-any.whl.
File metadata
- Download URL: doc_store-0.8.0-py3-none-any.whl
- Upload date:
- Size: 7.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b8da7dd29e8ceaa573f1de22ee390faafe8e75824e180faba3881e386ff0849
|
|
| MD5 |
ffe6b6d0414a5390d5cc03963fcc30f2
|
|
| BLAKE2b-256 |
5974793b359b29ec760c84e48f2c0b538fc63ca13e20c55b8210d69f20d3c713
|