Skip to main content

面向 Agent 和 RAG 的文档处理 SDK

Project description

xParse Client

面向 Agent 和 RAG 的文档解析 Python SDK

PyPI version Python License


目录


SDK 安装

[!NOTE] 本 SDK 支持 Python 3.9 及以上版本。

uv(推荐)

uv add xparse-client

pip

pip install xparse-client

快速开始

API 概览

API 用途 返回值
client.parse.run() 同步解析文档 ParseResponse
client.parse.create_job() 创建异步解析任务 AsyncJobResponse
client.parse.get_job() 查询异步任务状态 JobStatusResponse
client.parse.wait_job() 轮询等待异步任务终态 JobStatusResponse

1. 环境配置

export TEXTIN_APP_ID="your-app-id"
export TEXTIN_SECRET_CODE="your-secret-code"

可以在 TextIn 开发者控制台 获取认证凭证。

2. 同步解析

from xparse_client import XParseClient, ParseConfig, Capabilities, Scope

client = XParseClient()

with open("document.pdf", "rb") as f:
    result = client.parse.run(
        file=f,
        filename="document.pdf",
        config=ParseConfig(
            capabilities=Capabilities(
                include_table_structure=True,
                title_tree=True,
            ),
            scope=Scope(page_range="1-10"),
        ),
    )

print(f"解析出 {len(result.elements)} 个元素")

# 访问 markdown
if result.markdown:
    print(result.markdown)

# 遍历元素
for el in result.elements:
    print(f"[{el.type}] {el.text[:80]}")

3. 异步任务

处理大文件时使用服务端异步任务:

with open("large_document.pdf", "rb") as f:
    job = client.parse.create_job(
        file=f,
        filename="large_document.pdf",
        webhook="https://example.com/callback",  # 可选
    )

print(f"任务已创建: {job.job_id}")

result = client.parse.wait_job(job_id=job.job_id, timeout=300.0, poll_interval=5.0)

if result.is_completed:
    # 异步任务返回 result_url,需要另外下载获取解析结果
    import httpx
    resp = httpx.get(result.result_url)
    print(resp.json())

配置说明

认证配置

SDK 按以下优先级自动解析凭证:构造参数 > 环境变量 > .env 文件

# 方式 1:环境变量 + 无参构造(推荐)
client = XParseClient()

# 方式 2:直接传参
client = XParseClient(
    app_id="your-app-id",
    secret_code="your-secret-code",
)

# 方式 3:.env 文件(需安装 pip install xparse-client[dotenv])
client = XParseClient()

超时和重试

client = XParseClient(
    timeout=120.0,      # 请求超时时间(秒),默认 630
    max_retries=3,      # 最大重试次数,默认 3
)

自定义 API 地址

client = XParseClient(
    server_url="https://custom-api.example.com"
)

自定义 HTTP 客户端

可以传入 httpx.Client 来自定义代理、SSL 证书等底层网络配置,SDK 会自动处理认证、重试和错误映射:

import httpx

http_client = httpx.Client(
    proxy="http://proxy.example.com:8080",
    verify="/path/to/custom-ca.pem",
)

client = XParseClient(
    app_id="your-app-id",
    secret_code="your-secret-code",
    http_client=http_client,
)

资源管理

with XParseClient() as client:
    result = client.parse.run(...)
    # 退出时自动关闭连接

错误处理

错误类层次

HTTP 层错误:

错误类 说明
XParseClientError 基础错误类,捕获所有 SDK 错误
ValidationError 客户端参数验证失败
ServerError 服务器错误 (HTTP 5xx)
APIError API 请求错误(基类)

业务层错误(HTTP 200 + 业务 code):

错误类 业务码 说明
AuthenticationError 40101/40102 认证失败
PermissionDeniedError 40103 IP 不在白名单
InsufficientBalanceError 40003 余额不足
InvalidParameterError 40004 参数错误
UnsupportedFileTypeError 40301 文件类型不支持
FileSizeError 40302 文件过大(限制 500MB)
CorruptedFileError 40422 文件损坏
PasswordProtectedError 40423 PDF 需要密码
ServiceUnavailableError 30203 服务暂时不可用

错误处理示例

from xparse_client.exceptions import (
    XParseClientError, BusinessError, AuthenticationError, APIError
)

try:
    with open("document.pdf", "rb") as f:
        result = client.parse.run(file=f, filename="document.pdf")
except AuthenticationError as e:
    print(f"认证失败: {e.message}")
except BusinessError as e:
    print(f"业务错误 [{e.business_code}]: {e.message}, x_request_id={e.x_request_id}")
except APIError as e:
    print(f"API 错误 [HTTP {e.status_code}]: {e.message}, x_request_id={e.x_request_id}")
except XParseClientError as e:
    print(f"SDK 错误: {e.message}")

获取请求 ID

每个 API 请求都会返回 x_request_id,联系技术支持时提供此 ID 可加快问题定位:

result = client.parse.run(file=f, filename="document.pdf")
print(f"x_request_id={result.x_request_id}")

调试与日志

import logging
logging.getLogger("xparse_client").setLevel(logging.DEBUG)

本地开发

git clone https://github.com/intsig-textin/xparse-python-client.git
cd xparse-python-client

uv sync --dev
make test
make format

常用命令

make test          # 运行所有测试
make test-unit     # 运行单元测试
make test-cov      # 代码覆盖率
make format        # 代码格式化
make lint          # 代码检查

相关资源

故障排查

问题 解决方案
AuthenticationError 检查 TEXTIN_APP_IDTEXTIN_SECRET_CODE
FileSizeError 文件限制 500MB
TimeoutException 增加超时:XParseClient(timeout=300.0)

许可证

MIT License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xparse_client-0.3.0b31.tar.gz (94.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xparse_client-0.3.0b31-py3-none-any.whl (126.3 kB view details)

Uploaded Python 3

File details

Details for the file xparse_client-0.3.0b31.tar.gz.

File metadata

  • Download URL: xparse_client-0.3.0b31.tar.gz
  • Upload date:
  • Size: 94.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for xparse_client-0.3.0b31.tar.gz
Algorithm Hash digest
SHA256 a19c28d743a77e5c9ea9b15776a546eed7058f489b0ed23605852d14d35f1fa9
MD5 6a8e4edd82e4ea5133358c1057c5cd34
BLAKE2b-256 6f5871f33c2ed03799d9015eee28578404fe7fada15063bf7a6eb9d2dd158449

See more details on using hashes here.

File details

Details for the file xparse_client-0.3.0b31-py3-none-any.whl.

File metadata

File hashes

Hashes for xparse_client-0.3.0b31-py3-none-any.whl
Algorithm Hash digest
SHA256 ff71d3c29f237cd6002e153225ee298acd0677079a368bc193cec6aff43e0f13
MD5 61c62e71148b4ec970f53d73e671d29e
BLAKE2b-256 be82d87a4ba96a7c490933e106ae3af3a753918414f9b12f92a4720ce32c1358

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page