Storage for documents

These details have not been verified by PyPI

Project links

Project description

doc_store

DocStore 是一个用于管理文档、页面、布局和内容的客户端库。本文档介绍 DocClient 的完整 API 使用方法。

安装与配置

from doc_store import DocClient

快速开始

from doc_store import DocClient

# 创建客户端实例
client = DocClient(server_url="http://localhost:8000")

# 获取文档
doc = client.get_doc("doc-xxx")

# 遍历页面
for page in doc.pages:
    print(page.id, page.image_path)

# 关闭客户端
client.close()

# 或使用上下文管理器
with DocClient(server_url="http://localhost:8000") as client:
    doc = client.get_doc("doc-xxx")

数据模型

Doc（文档）

文档是最顶层的数据结构，代表一个 PDF 文件。

属性

属性	类型	说明
`id`	`str`	文档唯一标识符
`rid`	`int`	数据库行 ID
`create_time`	`int \| None`	创建时间戳
`update_time`	`int \| None`	更新时间戳
`pdf_path`	`str`	PDF 文件的 S3 路径
`pdf_filename`	`str \| None`	PDF 文件名
`pdf_filesize`	`int`	PDF 文件大小（字节）
`pdf_hash`	`str`	PDF 文件的 SHA256 哈希值
`num_pages`	`int`	PDF 页数
`page_width`	`float`	页面宽度
`page_height`	`float`	页面高度
`metadata`	`dict`	PDF 元数据
`orig_path`	`str \| None`	原始文件路径（如 Word/PPT）
`orig_filesize`	`int \| None`	原始文件大小
`orig_filename`	`str \| None`	原始文件名
`orig_hash`	`str \| None`	原始文件哈希值
`tags`	`list[str]`	标签列表
`attrs`	`dict[str, AttrValueType]`	属性字典
`metrics`	`dict[str, MetricValueType]`	指标字典

动态属性

属性	类型	说明
`pdf_bytes`	`bytes`	获取 PDF 文件的二进制内容
`pdf`	`PDFDocument`	获取 PDF 文档对象
`pages`	`list[Page]`	获取文档的所有页面（按页码排序）

方法

# 查找文档的页面
def find_pages(
    self,
    query: dict | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Page]

# 插入页面
def insert_page(self, page_idx: int, page_input: DocPageInput) -> Page

Page（页面）

页面代表文档中的单个页面。

属性

属性	类型	说明
`id`	`str`	页面唯一标识符
`rid`	`int`	数据库行 ID
`create_time`	`int \| None`	创建时间戳
`update_time`	`int \| None`	更新时间戳
`doc_id`	`str \| None`	所属文档 ID
`page_idx`	`int \| None`	页码索引
`image_path`	`str`	页面图片的 S3 路径
`image_filesize`	`int`	图片文件大小（字节）
`image_hash`	`str`	图片的 SHA256 哈希值
`image_width`	`int`	图片宽度（像素）
`image_height`	`int`	图片高度（像素）
`image_dpi`	`int \| None`	图片 DPI
`providers`	`list[str]`	已处理的 provider 列表
`old_image_path`	`str \| None`	迁移前的旧图片路径
`tags`	`list[str]`	标签列表
`attrs`	`dict[str, AttrValueType]`	属性字典
`metrics`	`dict[str, MetricValueType]`	指标字典

动态属性

属性	类型	说明
`image_bytes`	`bytes`	获取页面图片的二进制内容
`image`	`PIL.Image.Image`	获取页面图片对象
`image_presigned_link`	`str`	获取页面图片的预签名链接（24小时有效）
`image_pub_link`	`str`	获取页面图片的公开链接
`super_block`	`Block`	获取页面的超级块
`doc`	`Doc \| None`	获取所属文档

方法

# 尝试获取布局
def try_get_layout(self, provider: str, expand: bool = False) -> Layout | None

# 获取布局
def get_layout(self, provider: str, expand: bool = False) -> Layout

# 查找布局
def find_layouts(
    self,
    query: dict | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Layout]

# 查找块
def find_blocks(
    self,
    query: dict | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Block]

# 查找内容
def find_contents(
    self,
    query: dict | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Content]

# 插入布局
def insert_layout(
    self, 
    provider: str, 
    layout_input: LayoutInput, 
    insert_blocks: bool = False, 
    upsert: bool = False
) -> Layout

# 更新或插入布局
def upsert_layout(
    self, 
    provider: str, 
    layout_input: LayoutInput, 
    insert_blocks: bool = False
) -> Layout

# 插入单个块
def insert_block(self, block_input: BlockInput) -> Block

# 批量插入块
def insert_blocks(self, blocks: list[BlockInput]) -> list[Block]

# 插入带内容的块布局
def insert_content_blocks_layout(
    self,
    provider: str,
    content_blocks: list[ContentBlockInput],
    upsert: bool = False,
) -> Layout

Layout（布局）

布局代表页面的结构分析结果，包含多个块。

属性

属性	类型	说明
`id`	`str`	布局唯一标识符
`rid`	`int`	数据库行 ID
`create_time`	`int \| None`	创建时间戳
`update_time`	`int \| None`	更新时间戳
`page_id`	`str`	所属页面 ID
`provider`	`str`	布局提供者标识
`masks`	`list[MaskBlock]`	遮罩块列表
`blocks`	`list[Block]`	块列表
`relations`	`list[dict]`	块关系列表
`contents`	`list[Content]`	内容列表
`is_human_label`	`bool`	是否为人工标注
`tags`	`list[str]`	标签列表
`attrs`	`dict[str, AttrValueType]`	属性字典
`metrics`	`dict[str, MetricValueType]`	指标字典

动态属性

属性	类型	说明
`page`	`Page`	获取所属页面
`masked_image`	`PIL.Image.Image`	获取带遮罩的页面图片
`framed_image`	`PIL.Image.Image`	获取带块边框的页面图片

方法

# 列出所有内容版本
def list_versions(self) -> list[str]

# 列出所有块
def list_blocks(self) -> list[Block]

# 列出指定版本的内容
def list_contents(self, version: str | None = None) -> list[Content]

# 展开布局（加载所有块和内容）
def expand(self) -> Layout

Block（块）

块代表页面上的一个区域，如标题、段落、表格、图片等。

属性

属性	类型	说明
`id`	`str`	块唯一标识符
`rid`	`int`	数据库行 ID
`create_time`	`int \| None`	创建时间戳
`update_time`	`int \| None`	更新时间戳
`layout_id`	`str \| None`	所属布局 ID
`provider`	`str \| None`	块提供者标识
`page_id`	`str \| None`	所属页面 ID
`type`	`str`	块类型（如 title, text, table, image）
`bbox`	`list[float]`	归一化边界框 [x1, y1, x2, y2]，范围 0-1
`angle`	`Literal[None, 0, 90, 180, 270]`	旋转角度
`score`	`float \| None`	检测置信度分数
`image_path`	`str \| None`	独立块的图片路径
`image_filesize`	`int \| None`	图片文件大小
`image_hash`	`str \| None`	图片哈希值
`image_width`	`int \| None`	图片宽度
`image_height`	`int \| None`	图片高度
`versions`	`list[str]`	已有的内容版本列表
`tags`	`list[str]`	标签列表
`attrs`	`dict[str, AttrValueType]`	属性字典
`metrics`	`dict[str, MetricValueType]`	指标字典

动态属性

属性	类型	说明
`page`	`Page \| None`	获取所属页面
`image`	`PIL.Image.Image`	获取块图片（裁剪自页面或独立图片）
`image_bytes`	`bytes`	获取块图片的二进制内容
`image_pub_link`	`str`	获取块图片的公开链接

方法

# 尝试获取内容
def try_get_content(self, version: str) -> Content | None

# 获取内容
def get_content(self, version: str) -> Content

# 查找内容
def find_contents(
    self,
    query: dict | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Content]

# 插入内容
def insert_content(
    self, 
    version: str, 
    content_input: ContentInput, 
    upsert: bool = False
) -> Content

# 更新或插入内容
def upsert_content(self, version: str, content_input: ContentInput) -> Content

Content（内容）

内容代表块的识别结果，如 OCR 文本、LaTeX 公式等。

属性

属性	类型	说明
`id`	`str`	内容唯一标识符
`rid`	`int`	数据库行 ID
`create_time`	`int \| None`	创建时间戳
`update_time`	`int \| None`	更新时间戳
`page_id`	`str \| None`	所属页面 ID
`block_id`	`str`	所属块 ID
`version`	`str`	内容版本（如 ocr-v1, llm-gpt4）
`format`	`str`	内容格式（如 text, html, latex, markdown）
`content`	`str`	内容文本
`is_human_label`	`bool`	是否为人工标注
`tags`	`list[str]`	标签列表
`attrs`	`dict[str, AttrValueType]`	属性字典
`metrics`	`dict[str, MetricValueType]`	指标字典

动态属性

属性	类型	说明
`page`	`Page \| None`	获取所属页面
`block`	`Block`	获取所属块

Value（值）

Value 用于存储元素的扩展数据，如嵌入向量等。

属性

属性	类型	说明
`id`	`str`	值唯一标识符
`rid`	`int`	数据库行 ID
`elem_id`	`str`	所属元素 ID
`key`	`str`	键名
`type`	`str`	值类型（如 ndarray, json）
`value`	`Any`	值内容

动态属性

属性	类型	说明
`elem`	`DocElement`	获取所属元素

方法

# 解码值（如将 base64 编码的 ndarray 解码）
def decode(self) -> Value

Task（任务）

任务用于异步处理文档的各种操作。

属性

属性	类型	说明
`id`	`str`	任务唯一标识符
`rid`	`int`	数据库行 ID
`target`	`str`	目标元素 ID
`batch_id`	`str`	批次 ID
`command`	`str`	命令名称
`args`	`dict[str, Any]`	命令参数
`priority`	`int`	优先级
`status`	`str`	任务状态
`create_user`	`str`	创建用户
`update_user`	`str \| None`	更新用户
`error_message`	`str \| None`	错误信息

DocClient API

初始化

client = DocClient(
    server_url: str | None = None,      # 服务器 URL
    prefix: str = "/api/v1",             # API 路径前缀
    timeout: int = 300,                   # 读取超时（秒）
    connect_timeout: int = 30,            # 连接超时（秒）
    decode_value: bool = True,            # 是否自动解码 Value
)

健康检查

# 检查服务器健康状态
def health_check(self, show_stats: bool = False) -> dict

读取操作

获取单个元素

# 获取文档
def get_doc(self, doc_id: str) -> Doc
def get_doc_by_pdf_path(self, pdf_path: str) -> Doc
def get_doc_by_pdf_hash(self, pdf_hash: str) -> Doc

# 获取页面
def get_page(self, page_id: str) -> Page
def get_page_by_image_path(self, image_path: str) -> Page
def get_page_by_image_hash(self, image_hash: str) -> Page

# 获取布局
def get_layout(self, layout_id: str, expand: bool = False) -> Layout
def get_layout_by_page_id_and_provider(
    self, page_id: str, provider: str, expand: bool = False
) -> Layout

# 获取块
def get_block(self, block_id: str) -> Block
def get_block_by_image_path(self, image_path: str) -> Block
def get_super_block(self, page_id: str) -> Block

# 获取内容
def get_content(self, content_id: str) -> Content
def get_content_by_block_id_and_version(self, block_id: str, version: str) -> Content

# 获取值
def get_value(self, value_id: str) -> Value
def get_value_by_elem_id_and_key(self, elem_id: str, key: str) -> Value

# 获取评估布局/内容
def get_eval_layout(self, eval_layout_id: str) -> EvalLayout
def get_eval_content(self, eval_content_id: str) -> EvalContent

Try 方法（不存在时返回 None）

def try_get(self, elem_id: str) -> DocElement | None
def try_get_doc(self, doc_id: str) -> Doc | None
def try_get_doc_by_pdf_path(self, pdf_path: str) -> Doc | None
def try_get_doc_by_pdf_hash(self, pdf_hash: str) -> Doc | None
def try_get_page(self, page_id: str) -> Page | None
def try_get_page_by_image_path(self, image_path: str) -> Page | None
def try_get_page_by_image_hash(self, image_hash: str) -> Page | None
def try_get_layout(self, layout_id: str, expand: bool = False) -> Layout | None
def try_get_layout_by_page_id_and_provider(
    self, page_id: str, provider: str, expand: bool = False
) -> Layout | None
def try_get_block(self, block_id: str) -> Block | None
def try_get_block_by_image_path(self, image_path: str) -> Block | None
def try_get_content(self, content_id: str) -> Content | None
def try_get_content_by_block_id_and_version(
    self, block_id: str, version: str
) -> Content | None
def try_get_value(self, value_id: str) -> Value | None
def try_get_value_by_elem_id_and_key(self, elem_id: str, key: str) -> Value | None
def try_get_user(self, name: str) -> User | None
def try_get_task(self, task_id: str) -> Task | None

查询元素

# 通用查找方法（返回流式迭代器）
def find(
    self,
    elem_type: ElemType | type,           # 元素类型: "doc", "page", "layout", "block", "content", "value"
    query: dict | list[dict] | None = None,  # 查询条件（MongoDB 风格）
    query_from: ElemType | type | None = None,  # 从指定类型查询
    skip: int | None = None,               # 跳过数量
    limit: int | None = None,              # 限制数量
) -> Iterable[Element]

# 统计元素数量
def count(
    self,
    elem_type: ElemType | type,
    query: dict | list[dict] | None = None,
    query_from: ElemType | type | None = None,
    estimated: bool = False,               # 是否使用估算
) -> int

# 获取字段的唯一值列表
def distinct_values(
    self,
    elem_type: ElemType,
    field: Literal["tags", "providers", "provider", "versions", "version"],
    query: dict | None = None,
) -> list[str]

类型特定的查找方法

def find_docs(
    self,
    query: dict | list[dict] | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Doc]

def find_pages(
    self,
    query: dict | list[dict] | None = None,
    doc_id: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Page]

def find_layouts(
    self,
    query: dict | list[dict] | None = None,
    page_id: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Layout]

def find_blocks(
    self,
    query: dict | list[dict] | None = None,
    page_id: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Block]

def find_contents(
    self,
    query: dict | list[dict] | None = None,
    page_id: str | None = None,
    block_id: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Content]

def find_values(
    self,
    query: dict | list[dict] | None = None,
    elem_id: str | None = None,
    key: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Value]

获取唯一值列表的便捷方法

def doc_tags(self) -> list[str]
def page_tags(self) -> list[str]
def page_providers(self) -> list[str]
def layout_providers(self) -> list[str]
def layout_tags(self) -> list[str]
def block_tags(self) -> list[str]
def block_versions(self) -> list[str]
def content_versions(self) -> list[str]
def content_tags(self) -> list[str]

写入操作

插入元素

# 插入文档
def insert_doc(self, doc_input: DocInput, skip_ext_check: bool = False) -> Doc

# 插入页面
def insert_page(self, page_input: PageInput, skip_hash_check: bool = False) -> Page

# 插入布局
def insert_layout(
    self, 
    page_id: str, 
    provider: str, 
    layout_input: LayoutInput, 
    insert_blocks: bool = False, 
    upsert: bool = False
) -> Layout

# 插入块
def insert_block(self, page_id: str, block_input: BlockInput) -> Block
def insert_blocks(self, page_id: str, blocks: list[BlockInput]) -> list[Block]
def insert_standalone_block(self, block_input: StandaloneBlockInput) -> Block

# 插入内容
def insert_content(
    self, 
    block_id: str, 
    version: str, 
    content_input: ContentInput, 
    upsert: bool = False
) -> Content

# 插入带内容的块布局
def insert_content_blocks_layout(
    self,
    page_id: str,
    provider: str,
    content_blocks: list[ContentBlockInput],
    upsert: bool = False,
) -> Layout

# 插入值
def insert_value(self, elem_id: str, key: str, value_input: ValueInput) -> Value

# 插入评估布局/内容
def insert_eval_layout(
    self,
    layout_id: str,
    provider: str,
    blocks: list[EvalLayoutBlock] | None = None,
    relations: list[dict] | None = None,
) -> EvalLayout

def insert_eval_content(
    self,
    content_id: str,
    version: str,
    format: str,
    content: str,
) -> EvalContent

便捷插入方法（从本地文件）

# 上传本地文件并插入文档
def insert_local_doc(self, local_pdf_path: str) -> Doc

# 上传本地图片并插入页面
def insert_local_page(self, local_image_path: str) -> Page

# 上传本地图片并插入独立块
def insert_local_block(self, type: str, local_image_path: str) -> Block

Upsert 方法（更新或插入）

def upsert_layout(
    self, page_id: str, provider: str, layout_input: LayoutInput, insert_blocks: bool = False
) -> Layout

def upsert_content(self, block_id: str, version: str, content_input: ContentInput) -> Content

标签与属性操作

单个元素操作

# 标签操作
def add_tag(self, elem_id: str, tag: str) -> None
def del_tag(self, elem_id: str, tag: str) -> None

# 属性操作
def add_attr(self, elem_id: str, name: str, attr_input: AttrInput) -> None
def add_attrs(self, elem_id: str, attrs: dict[str, AttrValueType]) -> None
def del_attr(self, elem_id: str, name: str) -> None

# 指标操作
def add_metric(self, elem_id: str, name: str, metric_input: MetricInput) -> None
def del_metric(self, elem_id: str, name: str) -> None

# 批量标签/属性/指标操作
def tagging(self, elem_id: str, tagging_input: TaggingInput) -> None

批量操作

def batch_add_tag(self, elem_type: ElemType, tag: str, elem_ids: list[str]) -> None
def batch_del_tag(self, elem_type: ElemType, tag: str, elem_ids: list[str]) -> None
def batch_tagging(self, elem_type: ElemType, inputs: list[TaggingInput]) -> None

任务操作

# 列出任务
def list_tasks(
    self,
    query: dict | None = None,
    target: str | None = None,
    batch_id: str | None = None,
    command: str | None = None,
    status: str | None = None,
    create_user: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> list[Task]

# 获取任务
def get_task(self, task_id: str) -> Task

# 插入任务
def insert_task(self, target_id: str, task_input: TaskInput) -> Task

# 抓取新任务（用于任务处理器）
def grab_new_tasks(self, command: str, num: int = 10, hold_sec: int = 3600) -> list[Task]
def grab_new_task(self, command: str, hold_sec: int = 3600) -> Task | None

# 更新任务状态
def update_task(
    self,
    task_id: str,
    command: str,
    status: Literal["done", "error", "skipped"],
    error_message: str | None = None,
    check_mismatch: bool = False,
    task: Task | None = None,
) -> None

def update_grabbed_task(
    self,
    task: Task,
    status: Literal["done", "error", "skipped"],
    error_message: str | None = None,
) -> None

# 统计任务
def count_tasks(self, command: str | None = None) -> list[TaskCount]

管理操作

用户管理

def list_users(self) -> list[User]
def get_user(self, name: str) -> User
def insert_user(self, user_input: UserInput) -> User
def update_user(self, name: str, user_update: UserUpdate) -> User

已知名称管理（标签/属性/指标定义）

def list_known_names(self) -> list[KnownName]
def insert_known_name(self, known_name_input: KnownNameInput) -> KnownName
def update_known_name(self, name: str, known_name_update: KnownNameUpdate) -> KnownName
def add_known_option(self, attr_name: str, option_name: str, option_input: KnownOptionInput) -> None
def del_known_option(self, attr_name: str, option_name: str) -> None

S3 存储桶管理

def list_s3_buckets(self) -> list[S3Bucket]

嵌入向量操作

# 列出嵌入模型
def list_embedding_models(self) -> list[EmbeddingModel]
def get_embedding_model(self, name: str) -> EmbeddingModel
def insert_embedding_model(self, embedding_model: EmbeddingModel) -> EmbeddingModel
def update_embedding_model(self, name: str, update: EmbeddingModelUpdate) -> EmbeddingModel

# 添加嵌入向量
def add_embeddings(
    self, 
    elem_type: EmbeddableElemType,  # "page" | "block"
    model: str, 
    embeddings: list[EmbeddingInput]
) -> None

# 搜索嵌入向量
def search_embeddings(
    self, 
    elem_type: EmbeddableElemType, 
    model: str, 
    query: EmbeddingQuery
) -> list[Embedding]

触发器操作

def list_triggers(self) -> list[Trigger]
def get_trigger(self, trigger_id: str) -> Trigger
def insert_trigger(self, trigger_input: TriggerInput) -> Trigger
def update_trigger(self, trigger_id: str, trigger_input: TriggerInput) -> Trigger
def delete_trigger(self, trigger_id: str) -> None

文件操作

# 读取文件
def read_file(self, file_path: str, allow_local: bool = True) -> bytes

# 读取图片
def read_image(self, file_path: str) -> PIL.Image.Image

# 获取文件大小
def stat_file_size(self, file_path: str, allow_local: bool = True) -> int

# 上传本地文件到 S3
def upload_local_file(
    self, 
    file_type: Literal["doc", "page", "block"], 
    file_path: str
) -> str

# 获取 S3 客户端
def get_s3_client(self, path: str) -> boto3.client

便捷方法

通用获取方法

# 根据 ID 前缀自动判断类型并获取元素
def get(self, elem_id: str) -> DocElement

迭代处理

# 并行迭代处理元素
def iterate(
    self,
    elem_type: ElemType | type,
    func: Callable[[int, DocElement], None],
    query: dict | list[dict] | None = None,
    query_from: ElemType | type | None = None,
    max_workers: int = 10,
    total: int | None = None,
) -> None

异常处理

DocClient 定义了以下异常类型：

异常类	说明
`ElementNotFoundError`	元素不存在
`NotFoundError`	资源不存在（通用）
`DocExistsError`	文档已存在（包含 pdf_path 和 pdf_hash）
`PageExistsError`	页面已存在（包含 image_path 和 image_hash）
`ElementExistsError`	元素已存在
`AlreadyExistsError`	资源已存在（通用）
`UnauthorizedError`	未授权
`TaskMismatchError`	任务状态不匹配

使用示例：

from doc_store.interface import ElementNotFoundError, DocExistsError

try:
    doc = client.get_doc("doc-xxx")
except ElementNotFoundError:
    print("文档不存在")

try:
    doc = client.insert_doc(DocInput(pdf_path="s3://..."))
except DocExistsError as e:
    print(f"文档已存在: {e.pdf_path}, hash: {e.pdf_hash}")
    doc = client.get_doc_by_pdf_hash(e.pdf_hash)

输入模型

DocInput

class DocInput:
    pdf_path: str                    # PDF 文件的 S3 路径（必填）
    pdf_filename: str | None = None  # PDF 文件名
    orig_path: str | None = None     # 原始文件路径（Word/PPT）
    orig_filename: str | None = None # 原始文件名
    tags: list[str] | None = None    # 标签列表

PageInput

class PageInput:
    image_path: str                   # 图片的 S3 路径（必填）
    image_dpi: int | None = None      # 图片 DPI
    doc_id: str | None = None         # 所属文档 ID
    page_idx: int | None = None       # 页码索引
    tags: list[str] | None = None     # 标签列表

LayoutInput

class LayoutInput:
    blocks: list[ContentBlockInput]   # 块列表（必填）
    masks: list[MaskBlock] = []       # 遮罩列表
    relations: list[dict] | None = None  # 关系列表
    is_human_label: bool = False      # 是否人工标注
    tags: list[str] | None = None     # 标签列表

BlockInput

class BlockInput:
    type: str                         # 块类型（必填）
    bbox: list[float]                 # 归一化边界框 [x1, y1, x2, y2]（必填）
    angle: Literal[0, 90, 180, 270] | None = None  # 旋转角度
    score: float | None = None        # 置信度分数
    tags: list[str] | None = None     # 标签列表

ContentBlockInput

class ContentBlockInput(BlockInput):
    format: str | None = None         # 内容格式
    content: str | None = None        # 内容文本
    content_tags: list[str] | None = None  # 内容标签

StandaloneBlockInput

class StandaloneBlockInput:
    type: str                         # 块类型（必填）
    image_path: str                   # 图片的 S3 路径（必填）
    tags: list[str] | None = None     # 标签列表

ContentInput

class ContentInput:
    format: str                       # 内容格式（必填）
    content: str                      # 内容文本（必填）
    is_human_label: bool = False      # 是否人工标注
    tags: list[str] | None = None     # 标签列表

ValueInput

class ValueInput:
    value: Any                        # 值（必填）
    type: str | None = None           # 值类型

AttrInput

class AttrInput:
    value: str | list[str] | int | bool  # 属性值（必填）

MetricInput

class MetricInput:
    value: float | int                # 指标值（必填）

TaggingInput

class TaggingInput:
    elem_id: str | None = None        # 元素 ID（批量操作时必填）
    tags: list[str] | None = None     # 要添加的标签
    attrs: dict[str, AttrValueType] | None = None  # 要添加的属性
    metrics: dict[str, MetricValueType] | None = None  # 要添加的指标
    del_tags: list[str] | None = None  # 要删除的标签
    del_attrs: list[str] | None = None  # 要删除的属性
    del_metrics: list[str] | None = None  # 要删除的指标

TaskInput

class TaskInput:
    command: str                      # 命令名称（必填）
    args: dict[str, Any] | None = None  # 命令参数
    priority: int = 0                 # 优先级
    batch_id: str | None = None       # 批次 ID

EmbeddingInput

class EmbeddingInput:
    elem_id: str                      # 元素 ID（必填）
    vector: list[float]               # 嵌入向量（必填）

EmbeddingQuery

class EmbeddingQuery:
    vector: list[float]               # 查询向量（必填）
    k: int                            # 返回数量（必填）
    show_vector: bool = False         # 是否返回向量

TriggerInput

class TriggerInput:
    name: str                         # 触发器名称（必填）
    description: str                  # 描述（必填）
    condition: TriggerCondition       # 触发条件（必填）
    actions: list[TriggerAction]      # 触发动作列表（必填）
    disabled: bool = False            # 是否禁用
    display_order: float | None = None  # 显示顺序

元素通用方法

所有继承自 DocElement 的对象（Doc, Page, Layout, Block, Content）都具有以下通用方法：

标签操作

def add_tag(self, tag: str) -> None
def del_tag(self, tag: str) -> None

属性操作

def add_attr(self, name: str, attr_input: AttrInput) -> None
def add_attrs(self, attrs: dict[str, AttrValueType]) -> None
def del_attr(self, name: str) -> None

指标操作

def add_metric(self, name: str, metric_input: MetricInput) -> None
def del_metric(self, name: str) -> None

批量标签/属性/指标操作

def tagging(self, tagging_input: TaggingInput) -> None

Value 操作

def try_get_value(self, key: str) -> Value | None
def get_value(self, key: str) -> Value
def find_values(
    self,
    query: dict | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> Iterable[Value]
def insert_value(self, key: str, value_input: ValueInput) -> Value

Task 操作

def list_tasks(
    self,
    query: dict | None = None,
    batch_id: str | None = None,
    command: str | None = None,
    status: str | None = None,
    create_user: str | None = None,
    skip: int | None = None,
    limit: int | None = None,
) -> list[Task]
def insert_task(self, task_input: TaskInput) -> Task

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.0

Apr 9, 2026

0.8.1

Mar 11, 2026

0.8.0

Mar 11, 2026

This version

0.7.9

Feb 22, 2026

0.7.8

Feb 13, 2026

0.7.7

Jan 27, 2026

0.7.6

Jan 12, 2026

0.7.5

Jan 10, 2026

0.7.4

Jan 10, 2026

0.7.3

Jan 10, 2026

0.7.2

Jan 7, 2026

0.7.1

Dec 25, 2025

0.7.0 yanked

Dec 25, 2025

Reason this release was yanked:

task.store has bug

0.6.1

Dec 13, 2025

0.6.0

Dec 8, 2025

0.5.1

Dec 4, 2025

0.5.0

Dec 4, 2025

0.4.6

Dec 1, 2025

0.4.5

Nov 28, 2025

0.4.4

Nov 25, 2025

0.4.2

Nov 24, 2025

0.4.1

Nov 19, 2025

0.4.0

Nov 14, 2025

0.3.1

Oct 30, 2025

0.3.0

Oct 23, 2025

0.2.0

Oct 17, 2025

0.1.1

Oct 11, 2025

0.1.0

Oct 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_store-0.7.9.tar.gz (7.3 MB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_store-0.7.9-py3-none-any.whl (7.3 MB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file doc_store-0.7.9.tar.gz.

File metadata

Download URL: doc_store-0.7.9.tar.gz
Upload date: Feb 22, 2026
Size: 7.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doc_store-0.7.9.tar.gz
Algorithm	Hash digest
SHA256	`4eb3db1b804ba01fd5d801d51de1d1ef60da737deb81c285d912053a18a3229e`
MD5	`aee396f887c6a7b2582003d3ac8b7b81`
BLAKE2b-256	`cfa8e199198f942ce551a9f03fd653a2cb4126ce5f7e7c5f8c7c938d01402d61`

See more details on using hashes here.

File details

Details for the file doc_store-0.7.9-py3-none-any.whl.

File metadata

Download URL: doc_store-0.7.9-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 7.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for doc_store-0.7.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`347476d65fc3316c0eb968947af7f38617c58587bbf872272a9ad1710a68e959`
MD5	`6125ba046ae6f23aa8177ea2bb5d4a98`
BLAKE2b-256	`33012e385389d0b17c118161373735b0220312cf319eddc35fb43622d115384e`

See more details on using hashes here.

doc-store 0.7.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

doc_store

目录

安装与配置

快速开始

数据模型

Doc（文档）

属性

动态属性

方法

Page（页面）

属性

动态属性

方法

Layout（布局）

属性

动态属性

方法

Block（块）

属性

动态属性

方法

Content（内容）

属性

动态属性

Value（值）

属性

动态属性

方法

Task（任务）

属性

DocClient API

初始化

健康检查

读取操作

获取单个元素

Try 方法（不存在时返回 None）

查询元素

类型特定的查找方法

获取唯一值列表的便捷方法

写入操作

插入元素

便捷插入方法（从本地文件）

Upsert 方法（更新或插入）

标签与属性操作

单个元素操作

批量操作

任务操作

管理操作

用户管理

已知名称管理（标签/属性/指标定义）

S3 存储桶管理

嵌入向量操作

触发器操作

文件操作

便捷方法

通用获取方法

迭代处理

异常处理

输入模型

DocInput

PageInput

LayoutInput

BlockInput

ContentBlockInput

StandaloneBlockInput

ContentInput

ValueInput

AttrInput

MetricInput

TaggingInput

TaskInput