面向 Agent 和 RAG 的文档处理 SDK

These details have not been verified by PyPI

Project links

Project description

xParse Client

面向 Agent 和 RAG 的新一代文档处理 Python SDK

SDK 安装

[!NOTE] 本 SDK 支持 Python 3.9 及以上版本。

uv（推荐）

uv add xparse-client

pip

pip install xparse-client

可选依赖

根据使用的连接器安装额外依赖：

pip install xparse-client[s3]       # S3 兼容存储（AWS S3、MinIO、阿里云 OSS 等）
pip install xparse-client[milvus]   # Milvus 向量数据库
pip install xparse-client[qdrant]   # Qdrant 向量数据库
pip install xparse-client[smb]      # SMB 文件共享
pip install xparse-client[dotenv]   # .env 文件自动加载
pip install xparse-client[all]      # 所有可选依赖

快速开始

API 概览

API	用途	返回值
`client.parse.run()`	解析单个文件	`ParseResponse`
`client.parse.create_job()`	创建异步解析任务	`AsyncJobResponse`
`client.parse.get_job()`	查询异步任务状态	`JobStatusResponse`
`client.parse.wait_job()`	轮询等待异步任务终态	`JobStatusResponse`
`client.extract.run()`	提取结构化数据	`ParseResponse`（含 `.result`）*
`client.pipeline.run()`	多阶段流水线（parse → chunk → embed）	见下方说明

* Extract API 复用 ParseResponse 类型，提取结果通过 .result 字段获取（dict），解析出的元素仍通过 .elements 访问。

如何选择：

只需要解析 → parse.run()
需要结构化提取 → extract.run()
需要分块 + 向量化，或批量处理多个文件 → pipeline.run()

1. 环境配置

export TEXTIN_APP_ID="your-app-id"
export TEXTIN_SECRET_CODE="your-secret-code"

可以在 TextIn 开发者控制台获取认证凭证。

2. 解析文档

from xparse_client import XParseClient
from xparse_client.models import ParseConfig

client = XParseClient()

with open("document.pdf", "rb") as f:
    result = client.parse.run(
        file=f,
        filename="document.pdf",
        config=ParseConfig(provider="textin")
    )

print(f"解析出 {len(result.elements)} 个元素")
# element 主要字段：.element_id (str), .type (str), .text (str), .metadata (ElementMetadata)
# Pipeline embed 后还包含：.embeddings (list[float])

3. 提取结构化数据

from xparse_client.models import ExtractConfig

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "文档标题"},
        "author": {"type": "string", "description": "作者"},
        "date": {"type": "string", "description": "日期"}
    },
    "required": ["title", "author", "date"]
}

with open("document.pdf", "rb") as f:
    result = client.extract.run(
        file=f,
        filename="document.pdf",
        extract_config=ExtractConfig(schema=schema)
    )

print(result.result)  # Extract API 返回的结构化数据

4. Pipeline：多阶段流水线

pipeline.run() 支持两种模式，通过参数自动判断。

Stage 顺序约束：ParseStage 必须在最前面，EmbedStage 必须在 ChunkStage 之后。执行顺序 = 列表顺序。

Embed Provider 与维度：

Provider	维度	说明
`qwen`	1024	通义千问 Embedding
`doubao`	2048	豆包 Embedding（支持降维至 1024）

Destination 的 dimension 参数必须与 embed 模型维度一致。

单文件模式 — 传入 file + filename，返回 PipelineResponse：

from xparse_client.models import ParseStage, ChunkStage, EmbedStage, EmbedConfig

with open("document.pdf", "rb") as f:
    result = client.pipeline.run(
        file=f, filename="document.pdf",
        stages=[ParseStage(), ChunkStage(), EmbedStage(config=EmbedConfig(provider="qwen"))]
    )
print(f"分块后元素: {len(result.elements)}")

批量模式 — 传入 source + destination，返回 WorkflowResult：

from xparse_client.connectors import LocalSource, MilvusDestination

result = client.pipeline.run(
    source=LocalSource(directory="./docs", pattern=["*.pdf"]),
    destination=MilvusDestination(db_path="./vectors.db", collection_name="documents", dimension=1024),
    stages=[ParseStage(), ChunkStage(), EmbedStage(config=EmbedConfig(provider="qwen"))],
)
print(f"处理完成: {result.success}/{result.total}")

注意：两种模式返回不同类型。单文件返回 PipelineResponse（通过 result.elements 访问结果），批量返回 WorkflowResult（结果已写入 destination，通过 result.success/result.failed 查看统计）。

stages 也支持字典格式，可以跳过模型类直接传入：

stages = [
    {"type": "parse", "config": {"provider": "textin"}},
    {"type": "chunk", "config": {"strategy": "by_title"}},
    {"type": "embed", "config": {"provider": "qwen"}},
]
result = client.pipeline.run(file=f, filename="doc.pdf", stages=stages)

完整示例（含进度回调、错误处理、stats）见 example/3_local_workflow.py。

5. 异步任务处理

异步 Job 模式目前仅适用于 parse 单文件 API。Pipeline 批量模式不支持异步 Job。

处理大文件时使用服务端异步任务：

with open("large_document.pdf", "rb") as f:
    job = client.parse.create_job(
        file=f, filename="large_document.pdf",
        config=ParseConfig(provider="textin")
    )

print(f"任务已创建: {job.job_id}")

result = client.parse.wait_job(job_id=job.job_id, timeout=300.0, poll_interval=5.0)

if result.is_completed:
    # 异步任务返回 result_url，需要另外下载获取解析结果
    import httpx
    resp = httpx.get(result.result_url)
    print(resp.json())

配置说明

认证配置

SDK 会按以下优先级自动解析凭证：构造参数 > 环境变量 > .env 文件

方式 1：环境变量 + 无参构造（推荐）

export TEXTIN_APP_ID="your-app-id"
export TEXTIN_SECRET_CODE="your-secret-code"

client = XParseClient()  # 自动从环境变量读取

方式 2：直接传参

client = XParseClient(
    app_id="your-app-id",
    secret_code="your-secret-code"
)

方式 3：.env 文件（自动加载）

安装 dotenv 支持后（pip install xparse-client[dotenv]），在项目根目录创建 .env 文件：

TEXTIN_APP_ID=your-app-id
TEXTIN_SECRET_CODE=your-secret-code

client = XParseClient()  # 无需手动 load_dotenv()，SDK 自动加载

数据源与目的地

类型	类名	安装
本地文件系统	`LocalSource` / `LocalDestination`	内置
S3 兼容存储	`S3Source` / `S3Destination`	`xparse-client[s3]`
FTP	`FtpSource`	内置
SMB	`SmbSource`	`xparse-client[smb]`
Milvus / Zilliz Cloud	`MilvusDestination`	`xparse-client[milvus]`
Qdrant	`QdrantDestination`	`xparse-client[qdrant]`

S3 兼容存储支持 AWS S3、MinIO、阿里云 OSS、腾讯云 COS、火山引擎 TOS、华为云 OBS。详细配置请查看：云厂商配置指南。

高级配置

超时和重试

client = XParseClient(
    timeout=120.0,      # 请求超时时间（秒），默认 630
    max_retries=3,      # 最大重试次数，默认 3
)

自定义 API 地址

client = XParseClient(
    server_url="https://custom-api.example.com/api/xparse"
)

自定义 HTTP 客户端

import httpx

http_client = httpx.Client(
    headers={"x-custom-header": "value"},
    proxy="http://proxy.example.com:8080"
)

client = XParseClient(http_client=http_client)

批量模式错误处理策略

on_error="stop"（默认）：遇到第一个失败文件立即停止，抛出异常
on_error="continue"：记录失败并继续处理后续文件，不中断整个 workflow

max_retries 仅作用于 HTTP 请求层面的自动重试，不保证 stage 级幂等性。

资源管理

XParseClient 支持上下文管理器，自动管理底层 HTTP 连接：

with XParseClient() as client:
    result = client.parse.run(...)
    # 退出时自动关闭连接

短脚本、Jupyter Notebook 等场景下可省略 with，Python 垃圾回收会自动清理。

错误处理

错误类层次

HTTP 层错误（检查 HTTP 状态码）：

错误类	说明
`XParseClientError`	基础错误类，捕获所有 SDK 错误
`ValidationError`	客户端参数验证失败
`ServerError`	服务器错误 (HTTP 5xx)
`APIError`	API 请求错误（基类）

业务层错误（HTTP 200 + 业务 code）：

错误类	业务码	说明
`BusinessError`	-	通用业务错误（基类）
`AuthenticationError`	40101/40102	认证失败
`PermissionDeniedError`	40103	IP 不在白名单
`RateLimitError`	40306	速率限制
`InsufficientBalanceError`	40003	余额不足
`InvalidParameterError`	40004	参数错误
`UnsupportedFileTypeError`	40301	文件类型不支持
`FileSizeError`	40302	文件过大
`CorruptedFileError`	40422	文件损坏
`PasswordProtectedError`	40423	PDF 需要密码
`ServiceUnavailableError`	30203	服务暂时不可用

服务端只返回 HTTP 200 或 5xx 状态码。所有业务错误通过 HTTP 200 + 业务 code 返回。

错误处理示例

from xparse_client.exceptions import (
    XParseClientError, BusinessError, AuthenticationError, RateLimitError, APIError
)

try:
    with open("document.pdf", "rb") as f:
        result = client.parse.run(file=f, filename="document.pdf")
except AuthenticationError as e:
    print(f"认证失败: {e.message}，请检查凭证")
except RateLimitError as e:
    print(f"速率限制: 建议等待 {e.retry_after} 秒后重试")
except BusinessError as e:
    print(f"业务错误 [{e.business_code}]: {e.message}, x_request_id={e.x_request_id}")
except APIError as e:
    print(f"API 错误 [HTTP {e.status_code}]: {e.message}, x_request_id={e.x_request_id}")
except XParseClientError as e:
    print(f"SDK 错误: {e.message}")

获取请求 ID（x_request_id）

每个 API 请求都会返回 x_request_id，联系技术支持时提供此 ID 可加快问题定位：

# 成功时
result = client.parse.run(file=f, filename="document.pdf")
logger.info(f"解析完成, x_request_id={result.x_request_id}")

# 异常时
except APIError as e:
    logger.error(f"请求失败, x_request_id={e.x_request_id}")

调试与日志

import logging

# 启用 SDK 调试日志
logging.getLogger("xparse_client").setLevel(logging.DEBUG)

client = XParseClient()

级别	用途	输出内容
`DEBUG`	开发调试	详细的请求/响应日志
`INFO`	正常运行	关键操作日志（默认）
`WARNING`	警告信息	潜在问题提示
`ERROR`	错误信息	错误详情和堆栈

使用示例

完整示例代码请查看 example/ 目录：

基础 API 使用 — 解析、提取、Pipeline
服务端异步任务 — 大文件异步处理
本地批处理工作流 — 批量处理 + 进度回调
生产环境最佳实践 — 错误处理、日志、自定义 Source

本地开发

环境准备

git clone https://github.com/intsig-textin/xparse-python-client.git
cd xparse-python-client

uv sync --dev        # 安装开发依赖
make test            # 运行测试
make format          # 代码格式化

项目结构

xparse-client/
├── xparse_client/          # 主包
│   ├── _client.py          # XParseClient 主入口（懒加载）
│   ├── _config.py          # SDK 配置类
│   ├── _http.py            # HTTP 客户端（httpx）
│   ├── _base.py            # BaseAPI 基类
│   ├── exceptions.py       # 异常类
│   ├── api/                # API 模块（parse, extract, pipeline）
│   ├── models/             # 数据模型
│   └── connectors/         # Source/Destination
├── tests/                  # 测试
├── example/                # 示例代码
├── docs/                   # 文档
└── Makefile                # 开发命令

常用命令

make test          # 运行所有测试
make test-unit     # 运行单元测试
make test-cov      # 代码覆盖率
make format        # 代码格式化
make lint          # 代码检查
make clean         # 清理缓存

贡献

Fork 仓库并创建特性分支
编写代码和测试，确保 make test 通过、覆盖率 ≥ 80%
运行 make format && make lint 后提交 Pull Request

问题	解决方案
`AuthenticationError` 认证失败	检查 `TEXTIN_APP_ID` 和 `TEXTIN_SECRET_CODE` 环境变量
`FileSizeError` 文件过大	parse 限制 500MB，extract 限制 50MB，pipeline 含 extract 时 50MB、否则 500MB
`MilvusException` 维度不匹配	qwen 模型 1024 维，doubao 模型 2048 维，确保 `dimension` 参数匹配
`TimeoutException` 连接超时	增加超时：`XParseClient(timeout=300.0)`，或使用异步任务
`RateLimitError` 速率限制	等待 `retry_after` 秒后重试，或联系客服提升配额

许可证

MIT License

感谢使用 xParse Client！

Star on GitHub | Read the Docs | Discussions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 2, 2026

0.3.0b32 pre-release

Apr 2, 2026

0.3.0b31 pre-release

Apr 1, 2026

0.3.0b30 pre-release

Apr 1, 2026

0.3.0b29 pre-release

Apr 1, 2026

0.3.0b28 pre-release

Mar 26, 2026

0.3.0b27 pre-release

Mar 11, 2026

0.3.0b26 pre-release

Mar 11, 2026

This version

0.3.0b25 pre-release

Mar 10, 2026

0.3.0b24 pre-release

Mar 10, 2026

0.3.0b23 pre-release

Mar 3, 2026

0.3.0b22 pre-release

Mar 2, 2026

0.3.0b21 pre-release

Feb 28, 2026

0.3.0b20 pre-release

Feb 28, 2026

0.3.0b19 pre-release

Feb 28, 2026

0.3.0b18 pre-release

Feb 28, 2026

0.3.0b17 pre-release

Feb 25, 2026

0.3.0b16 pre-release

Feb 24, 2026

0.3.0b15 pre-release

Feb 24, 2026

0.3.0b14 pre-release

Feb 5, 2026

0.3.0b13 pre-release

Feb 5, 2026

0.3.0b12 pre-release

Feb 5, 2026

0.3.0b11 pre-release

Feb 3, 2026

0.3.0b10 pre-release

Feb 3, 2026

0.3.0b9 pre-release

Feb 3, 2026

0.3.0b8 pre-release

Feb 3, 2026

0.3.0b7 pre-release

Feb 2, 2026

0.3.0b6 pre-release

Feb 2, 2026

0.3.0b5 pre-release

Jan 29, 2026

0.3.0b4 pre-release

Jan 29, 2026

0.3.0b3 pre-release

Jan 29, 2026

0.3.0b2 pre-release

Jan 29, 2026

0.3.0b1 pre-release

Jan 29, 2026

0.2.20

Jan 22, 2026

0.2.19

Jan 9, 2026

0.2.18

Jan 6, 2026

0.2.17

Dec 31, 2025

0.2.16

Dec 24, 2025

0.2.15

Dec 23, 2025

0.2.14

Dec 18, 2025

0.2.13

Dec 18, 2025

0.2.12

Dec 18, 2025

0.2.11

Dec 18, 2025

0.2.10

Dec 16, 2025

0.2.9

Dec 11, 2025

0.2.8

Dec 4, 2025

0.2.7

Dec 3, 2025

0.2.6

Dec 3, 2025

0.2.5

Dec 2, 2025

0.2.4

Dec 2, 2025

0.2.3

Dec 2, 2025

0.2.2

Dec 2, 2025

0.2.1

Dec 2, 2025

0.2.0

Nov 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xparse_client-0.3.0b25.tar.gz (98.3 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xparse_client-0.3.0b25-py3-none-any.whl (128.0 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file xparse_client-0.3.0b25.tar.gz.

File metadata

Download URL: xparse_client-0.3.0b25.tar.gz
Upload date: Mar 10, 2026
Size: 98.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for xparse_client-0.3.0b25.tar.gz
Algorithm	Hash digest
SHA256	`7a08a404f4ee95213a5a1b2fdb5af55ece1ed0fe34ff69dd032fc6fb877b029f`
MD5	`740729315bf7ee9dff0b823f3754567c`
BLAKE2b-256	`7b13767debdad54f3b2bdd2127e636c73377da2cab95cdeb593430c62cd36ff3`

See more details on using hashes here.

File details

Details for the file xparse_client-0.3.0b25-py3-none-any.whl.

File metadata

Download URL: xparse_client-0.3.0b25-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 128.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for xparse_client-0.3.0b25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`468fc774a1bf2d3efb3fba1f0a2cbfa6105949a10e55dc670afa266957b66091`
MD5	`36eb91de828deea793ff468adff534ba`
BLAKE2b-256	`ac9745c747c08893f1ab648e787a5b48312b3039950da07351300314ffba3f09`

See more details on using hashes here.

xparse-client 0.3.0b25

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

xParse Client

面向 Agent 和 RAG 的新一代文档处理 Python SDK

目录

SDK 安装

uv（推荐）

pip

可选依赖

快速开始

API 概览

1. 环境配置

2. 解析文档

3. 提取结构化数据

4. Pipeline：多阶段流水线

5. 异步任务处理

配置说明

认证配置

数据源与目的地

高级配置

超时和重试

自定义 API 地址

自定义 HTTP 客户端

批量模式错误处理策略

资源管理

错误处理

错误类层次

错误处理示例

获取请求 ID（x_request_id）

调试与日志

使用示例

本地开发

环境准备

项目结构

常用命令

贡献

相关资源

故障排查

许可证

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes