Skip to main content

A Python async SDK that wraps the PaddleOCR AI Studio API into a clean, type-safe interface.

Project description

paddleocr-api-python

English | 中文


English

A Python async SDK that wraps the PaddleOCR AI Studio API into a clean, type-safe interface. Upload a document, await the result, and get Markdown back — without touching raw HTTP.

Features

  • Async-first — built on httpx.AsyncClient and asyncio, with native context manager support.
  • Full model coveragePaddleOCR-VL-1.6 (default), PaddleOCR-VL-1.5, PaddleOCR-VL, PP-OCRv5, and PaddleOCR.
  • Flexible input — submit by local file path, raw bytes, or remote URL.
  • Rich job control — poll real-time state, extracted page count, start/end times, and error messages.
  • Markdown export — get a clean Markdown document plus the URLs of all embedded images.
  • Fine-grained options — toggle layout detection, chart/seal/table recognition, cross-page table merging, title leveling, NMS, image orientation correction, and more.

Installation

pip install paddleocr-api-python

Dependencies: aiofiles, httpx, typing-extensions, python-dotenv.

Authentication

Get an access token from https://aistudio.baidu.com/account/accessToken.

Either pass it explicitly:

client = AistudioClient(api_key="your_token_here")

Or set it via environment variable (a .env file is loaded automatically):

AISTUDIO_ACCESS_TOKEN=your_token_here

Quick Start

import asyncio
from paddleocr_api import AistudioClient, State

async def main():
    async with AistudioClient() as client:
        job = await client.create_job(file_path="paper.pdf")

        async with job:
            while True:
                state = await job.state
                if state == State.DONE:
                    break
                if state == State.FAILED:
                    raise RuntimeError(await job.error_message)
                await asyncio.sleep(5)

            markdown = await job.markdown
            with open("output.md", "w", encoding="utf-8") as f:
                f.write(markdown.text)

asyncio.run(main())

Submitting Jobs

create_job accepts three mutually compatible input modes:

# From a local path
await client.create_job(file_path="doc.pdf")

# From bytes already in memory
await client.create_job(file_bytes=pdf_bytes)

# From a public URL
await client.create_job(file_url="https://example.com/doc.pdf")

Selecting a Model

from paddleocr_api import Model

await client.create_job(
    file_path="doc.pdf",
    model=Model.PADDLE_OCR_VL_1_6,  # default
)
Model Notes
PaddleOCR-VL-1.6 Default. Latest vision-language model.
PaddleOCR-VL-1.5 Scheduled for retirement on 2026-06-17.
PaddleOCR-VL Base VL model.
PP-OCRv5 Classic OCR pipeline.
PaddleOCR Base OCR.

Optional Payload

Pass an OptionalPayload dict to fine-tune recognition behavior:

from paddleocr_api import LayoutShapeMode, PromptLabel

await client.create_job(
    file_path="doc.pdf",
    optional_payload={
        "useLayoutDetection": True,
        "useChartRecognition": True,
        "useSealRecognition": True,
        "mergeTables": True,
        "relevelTitles": True,
        "layoutShapeMode": LayoutShapeMode.AUTO,
        "repetitionPenalty": 1.0,
        "temperature": 0.0,
        "topP": 1.0,
    },
)

Key options:

Field Default Purpose
useDocOrientationClassify False Auto-correct 0/90/180/270° rotation.
useDocUnwarping False Flatten warped or wrinkled pages.
useLayoutDetection True Region-aware parsing. Disable for single-region docs.
useChartRecognition False Convert charts to tables.
useSealRecognition True Extract seal text.
useOcrForImageBlock False OCR inside image regions.
mergeTables True Merge tables that span pages.
relevelTitles True Infer heading hierarchy.
repetitionPenalty 1.0 Raise to suppress repeated output.
temperature 0.0 Lower for stability, higher to reduce omissions.
topP 1.0 Lower for more conservative output.
layoutNms True Drop overlapping detection boxes.
markdownIgnoreLabels all Filter headers, footers, page numbers, footnotes, etc.

Tracking a Job

async with job:
    print(await job.state)              # State.PENDING / RUNNING / DONE / FAILED
    print(await job.total_pages)        # e.g. 8
    print(await job.extracted_pages)    # e.g. 3
    print(await job.start_time)         # datetime
    print(await job.end_time)           # datetime
    print(await job.error_message)      # str or None

Status queries are cached for status_update_interval seconds (default 2) to avoid hammering the API.

Working with Results

result = await job.result          # full Result object
markdown = await job.markdown      # Markdown(text=..., images=...)

# Save Markdown
with open("doc.md", "w", encoding="utf-8") as f:
    f.write(markdown.text)

# Download embedded images
import httpx
async with httpx.AsyncClient() as http:
    for rel_path, url in markdown.images.items():
        data = (await http.get(url)).content
        # write `data` to `rel_path`

The Result object also exposes per-page layout details via layout_parsing_results, raw page sizes via data_info, and preprocessed image URLs via preprocessed_images.

Error Handling

All exceptions inherit from PaddleOCRError:

  • AistudioClientError — client configuration issues (e.g. missing token).
  • JobCreationError — failure when submitting a job.
  • JobStatusQueryError — failure when polling status.

Use job.query_status_safe() instead of query_status() to get the cached state on failure rather than raising.

License

Apache-2.0


中文

PaddleOCR AI Studio API 封装为简洁、类型安全的 Python 异步 SDK。上传文档、等待结果、拿到 Markdown —— 无需手写任何 HTTP 请求。

特性

  • 异步优先 —— 基于 httpx.AsyncClientasyncio 构建,原生支持上下文管理器。
  • 全模型支持 —— PaddleOCR-VL-1.6(默认)、PaddleOCR-VL-1.5PaddleOCR-VLPP-OCRv5PaddleOCR
  • 灵活输入 —— 支持本地路径、字节流、远程 URL 三种提交方式。
  • 完善的任务控制 —— 实时查询状态、已抽取页数、起止时间、错误信息。
  • Markdown 导出 —— 直接获取整洁的 Markdown 文本及所有内嵌图片 URL。
  • 细粒度参数 —— 可控制版面分析、图表/印章/表格识别、跨页表格合并、标题分级、NMS、图像方向矫正等。

安装

pip install paddleocr-api-python

依赖:aiofileshttpxtyping-extensionspython-dotenv

身份验证

https://aistudio.baidu.com/account/accessToken 获取访问令牌。

可以显式传入:

client = AistudioClient(api_key="your_token_here")

也可以通过环境变量传入(自动加载 .env 文件):

AISTUDIO_ACCESS_TOKEN=your_token_here

快速上手

import asyncio
from paddleocr_api import AistudioClient, State

async def main():
    async with AistudioClient() as client:
        job = await client.create_job(file_path="paper.pdf")

        async with job:
            while True:
                state = await job.state
                if state == State.DONE:
                    break
                if state == State.FAILED:
                    raise RuntimeError(await job.error_message)
                await asyncio.sleep(5)

            markdown = await job.markdown
            with open("output.md", "w", encoding="utf-8") as f:
                f.write(markdown.text)

asyncio.run(main())

提交任务

create_job 支持三种输入方式:

# 本地路径
await client.create_job(file_path="doc.pdf")

# 内存字节流
await client.create_job(file_bytes=pdf_bytes)

# 公网 URL
await client.create_job(file_url="https://example.com/doc.pdf")

选择模型

from paddleocr_api import Model

await client.create_job(
    file_path="doc.pdf",
    model=Model.PADDLE_OCR_VL_1_6,  # 默认
)
模型 备注
PaddleOCR-VL-1.6 默认,最新视觉语言模型。
PaddleOCR-VL-1.5 计划于 2026-06-17 下线。
PaddleOCR-VL 基础 VL 模型。
PP-OCRv5 经典 OCR 流水线。
PaddleOCR 基础 OCR。

可选参数

通过 OptionalPayload 字典精调识别行为:

from paddleocr_api import LayoutShapeMode, PromptLabel

await client.create_job(
    file_path="doc.pdf",
    optional_payload={
        "useLayoutDetection": True,
        "useChartRecognition": True,
        "useSealRecognition": True,
        "mergeTables": True,
        "relevelTitles": True,
        "layoutShapeMode": LayoutShapeMode.AUTO,
        "repetitionPenalty": 1.0,
        "temperature": 0.0,
        "topP": 1.0,
    },
)

常用参数:

字段 默认值 作用
useDocOrientationClassify False 自动矫正 0/90/180/270° 旋转。
useDocUnwarping False 矫正褶皱、倾斜等扭曲图像。
useLayoutDetection True 版面分区与排序。文档仅含单一区域时可关闭。
useChartRecognition False 将图表解析为表格。
useSealRecognition True 识别印章文字。
useOcrForImageBlock False 对图片区域中的文字进行 OCR。
mergeTables True 合并跨页表格。
relevelTitles True 识别段落标题级别。
repetitionPenalty 1.0 出现重复内容时可调高。
temperature 0.0 调低更稳定,调高减少漏识别。
topP 1.0 调低让模型更保守。
layoutNms True 移除重叠的检测框。
markdownIgnoreLabels 全部 过滤页眉、页脚、页码、脚注等辅助元素。

追踪任务

async with job:
    print(await job.state)              # State.PENDING / RUNNING / DONE / FAILED
    print(await job.total_pages)        # 如 8
    print(await job.extracted_pages)    # 如 3
    print(await job.start_time)         # datetime
    print(await job.end_time)           # datetime
    print(await job.error_message)      # str 或 None

状态查询带有 status_update_interval 秒的缓存(默认 2 秒),避免频繁请求。

处理结果

result = await job.result          # 完整的 Result 对象
markdown = await job.markdown      # Markdown(text=..., images=...)

# 保存 Markdown
with open("doc.md", "w", encoding="utf-8") as f:
    f.write(markdown.text)

# 下载内嵌图片
import httpx
async with httpx.AsyncClient() as http:
    for rel_path, url in markdown.images.items():
        data = (await http.get(url)).content
        # 将 data 写入 rel_path

Result 对象还通过 layout_parsing_results 暴露每页的版面细节,通过 data_info 提供原始页面尺寸,通过 preprocessed_images 提供预处理图像 URL。

异常处理

所有异常都继承自 PaddleOCRError

  • AistudioClientError —— 客户端配置错误(如缺少令牌)。
  • JobCreationError —— 任务提交失败。
  • JobStatusQueryError —— 状态查询失败。

如果希望查询失败时返回缓存而非抛出异常,使用 job.query_status_safe() 代替 query_status()

许可证

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paddleocr_api_python-0.0.2.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paddleocr_api_python-0.0.2-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file paddleocr_api_python-0.0.2.tar.gz.

File metadata

  • Download URL: paddleocr_api_python-0.0.2.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for paddleocr_api_python-0.0.2.tar.gz
Algorithm Hash digest
SHA256 00ca1d1f64eb98e4c6d7bfcd0d40d4ddb83b3485a4ca83807098b5b1e89b134b
MD5 ef888800e499d6d49d41a00837cd6494
BLAKE2b-256 99da13df9edf1b9b16c11bc41854dbd899c13cc49f1aaf86b1540bd0eae670f9

See more details on using hashes here.

File details

Details for the file paddleocr_api_python-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for paddleocr_api_python-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7e6b37b2992440e631d2f3d1072df117fb9aa10f9480b519bbd7d203858bb43a
MD5 8e49f7c1366bf56be98c1acf91c7d67f
BLAKE2b-256 9bc839f238c0fbde39246270db7223af932aca2d5534c4db855a6e998ce4bf55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page