Skip to main content

memect-ppx 是一个 PDF / 图片文档解析工具,将输入文件转换为结构化的 Markdown、JSON。支持本地模型(默认)和多种 LLM 后端(DeepSeek、PaddleOCR、GLM等),适用于高精度文档理解场景。

Project description

PPX Logo  PPX — High-Accuracy PDF & Image Parser

PyPI version PyPI downloads Python License Issues

简体中文 | English


Convert PDF and images to structured Markdown / JSON — locally, accurately, production-ready.

PPX is a source-available document parsing engine built for high-fidelity extraction of text, tables, figures, formulas, and layout from PDFs and images. It ships with a built-in OCR + layout pipeline and optionally offloads recognition to state-of-the-art LLM backends (DeepSeek-OCR, PaddleOCR-VL, GLM-OCR).

  • What output do I get? — Markdown and JSON; every object carries page coordinates.
  • Do I need a GPU? — No. The default backend runs on CPU. GPU (CUDA) is optional for throughput.
  • Does it handle scanned PDFs? — Yes. OCR is applied automatically when native text is absent.
  • Can I use my own LLM? — Yes. Any OpenAI-compatible endpoint is accepted via --backend.
  • Is it embeddable? — Free for personal, research, and noncommercial use. For commercial use, contact contact@memect.co.

Install

#>=3.12
$uv venv -p 3.12
$source .venv/bin/activate

# update
uv pip install --upgrade memect-ppx
#如果是在已经存在的环境中,建议先删除
#$uv pip uninstall opencv-python opencv-contrib-python opencv-contrib-headless opencv-contrib-python-headless
#$uv pip uninstall onnxruntime onnxruntime-gpu

#cpu版本
$uv pip install memect-ppx
$uv pip install onnxruntime --no-config
#or opencv-contrib-python-headless
$uv pip install opencv-contrib-python --no-config

#gpu版本
#安装依赖的cuda库,如果系统中已经全局安装,可以不安装,需要和onnxruntime-gpu的一致
#如果是其他版本,请根据onnxruntime-gpu的要求安装几个
$uv pip install memect-ppx[cuda]
$uv pip install onnxruntime-gpu --no-config
#or opencv-contrib-python-headless
$uv pip install opencv-contrib-python --no-config


#安装方法二
$git clone 
$cd ppx
$uv venv -p 3.12
#如果操作系统比较老<=ubuntu 20.04
#--no-install-package pdf-oxide
$uv sync --no-install-project
#如果需要使用gpu,如果系统中已经全局安装,可以不安装,或者安装另外的版本
$uv sync --extra cuda --no-install-project

#这两个必须手动安装
#or opencv-contrib-python-headless
$uv pip install opencv-contrib-python --no-config
#or onnxruntime-gpu 
$uv pip install onnxruntime --no-config


#命令说明:
#安装包的方式,请使用: ppx
#clone代码的方式,请使用:./ppx

#默认解析
$ppx parse a.pdf

#大模型解析,指定url即可,目前仅仅支持deepseek-ocr,paddleocr-vl,glm-ocr等模型
$ppx parse a.pdf --llm http://127.0.0.1:4000/v1
#如果使用的模型的名字不包含deepseek,paddle,glm等,需要指定,如下:
$ppx parse a.pdf --llm '{"name":"deepseek","base_url":"http://127.0.0.1:4000/v1","model":"xxxx","api_key":""}'

#如果经常使用,可以写到配置文件中
$mkdir conf
#可以为json文件或者py文件: settings={}
#参考src/memect/conf/settings.custom.py 语法
$vi conf/settings.py
$vi conf/log.py

#如果在配置文件中写好了路径和模型等,就不需要在命令行再指定
$ppx parse a.pdf --backend deepseek

PPX uses the pipeline mode by default. The parsed Markdown is typically written to output/doc.md when -o output/ is provided.


What Problems Does This Solve?

Problem How PPX Handles It
Native-text PDF with invisible/garbled characters Detects encoding anomalies; falls back to OCR per page
Scanned document with no embedded text Full-page OCR or vLLM backend
Complex table spanning multiple columns/rows LLM-based structural parsing, colspan/rowspan preserved
Math-heavy academic paper LaTeX formula extraction
Batch processing thousands of files Directory-level parse dir/ with -o output/

Example Outputs

Mixed table content

This example shows a mixed table scenario where the table body contains editable text, while much of the header area is still image-based.

Input snippet:

Input snippet

Markdown output:

Markdown output

JSON output:

JSON output

Scanned English table

This example shows a scanned English table parsing result.

Markdown output:

Scanned table Markdown output

JSON output:

Scanned table JSON output


Benchmarks

See docs/BENCHMARKS.md for benchmark results, citation, attribution, and compliance notes.


Capability Matrix

Capability Default (Local) DeepSeek-OCR PaddleOCR-VL GLM-OCR
Text extraction
Per-character coordinates
Table structure (colspan / rowspan)
Formula → LaTeX
Figure region extraction
CPU-only mode
CUDA acceleration
No external service required

Which Backend Should I Use?

Scenario Recommended Backend
Privacy-sensitive documents, air-gapped environment default
Highest accuracy on complex layouts deepseek
Good accuracy, lighter GPU footprint (~10 GB) paddle
Fast inference with speculative decoding glm
Quick integration test / CI pipeline default (CPU)

Quick Start

Default pipeline mode

ppx parse <input_path> -o <output_path>

# Example
ppx parse report.pdf -o output/

Parse a single file

# Auto-detect whether OCR is needed
ppx parse report.pdf

# Force OCR on every page
ppx parse report.pdf --ocr yes

# Skip OCR entirely
ppx parse report.pdf --ocr no

# Parse an image
ppx parse scan.png

Batch processing

# Parse all PDFs and images in a directory
ppx parse docs/

# Write output to a specific directory
ppx parse docs/ -o output/

Use an LLM backend

# DeepSeek-OCR (requires ~20 GB VRAM via vLLM)
ppx parse report.pdf --backend deepseek \
  --deepseek '{"base_url":"http://127.0.0.1:4000/v1","model":"deepseek-ocr-2","api_key":""}'

# PaddleOCR-VL (requires ~10 GB VRAM)
ppx parse report.pdf --backend paddle \
  --paddle '{"base_url":"http://127.0.0.1:4001/v1","model":"paddleocr-vl","api_key":""}'

# GLM-OCR (requires ~10 GB VRAM)
ppx parse report.pdf --backend glm \
  --glm '{"base_url":"http://127.0.0.1:4002/v1","model":"glmocr","api_key":""}'

Persist configuration

Tired of typing the same flags? Drop a config file:

mkdir conf
# conf/settings.py  (Python dict) or conf/settings.json
# Reference: src/memect/conf/settings.custom.py
# conf/settings.py
settings = {
    "pdf_parser.deepseek.model.base_url": "http://127.0.0.1:4000/v1",
    "pdf_parser.paddle.model.base_url": "http://127.0.0.1:4001/v1",
    "pdf_parser.glm.model.base_url": "http://127.0.0.1:4002/v1",
}

Now just run:

ppx parse report.pdf --backend deepseek

Use from python

PPX can be used directly as a library. If you call it repeatedly, a single global Parser instance is usually enough.

from memect.pdf.parser import Parser
from memect.pdf.base import KDocument, KDocumentFactory

# If you call it repeatedly, a single global parser is usually enough.
# If no arguments are passed, the default settings are used.
with Parser() as parser:
    doc = KDocument("/path/your.pdf")
    parser.parse(doc)

# Batch parsing with multiprocessing and default settings.
doc = KDocumentFactory("/path/your.pdf", params=None)
docs = [doc]
Parser.batch(docs, max_workers=1)

CLI Reference

ppx parse <path> [OPTIONS]

Arguments:
  path          PDF file, image file, or directory

Options:
  --backend     default | deepseek | paddle | glm   (default: default)
  --ocr         yes | no | auto                      (default: auto)
  --table       no | ybk | wbk | auto | llm          (default: auto)
  --pages       Page range, e.g. "1-5,10"
  --mode        page | tree                    (default: page)
  -o, --output  Output directory

Other subcommands:

ppx start               Launch HTTP API server

Output Format

Each parsed document is written to <input>.out/:

report.pdf.out/
├── doc.md          # full document in Markdown
├── doc.json        # full structured data with per-object coordinates
├── pages/          # per-page breakdown (one entry per page)
└── images/         # extracted figures/images (present when figures are detected)
Path Description
doc.md Markdown with figure references
doc.json JSON tree: document → pages → objects, each with bounding-box coordinates
pages/ Per-page Markdown and JSON, useful for page-level processing
images/ Extracted image regions; only present when the document contains figures

Platform Support

Platform Python CPU CUDA Notes
Linux >= 3.12 Recommended for production
macOS (Apple Silicon) >= 3.12
macOS (Intel) 3.12 – 3.13 Capped by OpenVINO
Windows >= 3.12 Community-tested

CUDA requires NVIDIA driver + CUDA 12.x and onnxruntime-gpu built for that CUDA version.


Launching LLM Services

PPX LLM backends are served via vLLM.

# 常用环境变量,可以附加在命令前面
export CUDA_VISIBLE_DEVICES=0
# 国内建议使用 ModelScope,下面的模型 ID 也是相对 ModelScope,HuggingFace 的可能有所不同
export VLLM_USE_MODELSCOPE=True

DeepSeek-OCR-2 (~20 GB VRAM)

ModelScope — note: vllm==0.19.1 produces garbled output, use a newer version.

vllm serve deepseek-ai/DeepSeek-OCR-2 \
  --served-model-name deepseek-ocr-2 \
  --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.8 \
  --port 4000

PaddleOCR-VL / PaddleOCR-VL-1.5 (~10 GB VRAM)

ModelScope PaddleOCR-VL · PaddleOCR-VL-1.5

# PaddleOCR-VL
vllm serve PaddlePaddle/PaddleOCR-VL \
  --served-model-name paddleocr-vl \
  --trust-remote-code \
  --max-num-batched-tokens 16384 \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0 \
  --gpu-memory-utilization 0.5 \
  --port 4001

# PaddleOCR-VL-1.5 (same model name and port — config unchanged)
vllm serve PaddlePaddle/PaddleOCR-VL-1.5 \
  --served-model-name paddleocr-vl \
  --trust-remote-code \
  --max-num-batched-tokens 16384 \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0 \
  --gpu-memory-utilization 0.5 \
  --port 4001

GLM-OCR (~10 GB VRAM)

ModelScope

vllm serve ZhipuAI/GLM-OCR \
  --served-model-name glmocr \
  --max-num-batched-tokens 16384 \
  --max-model-len 16384 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --gpu-memory-utilization 0.5 \
  --port 4002

FAQ

Does PPX support password-protected PDFs?

Not currently. Strip the password with a tool like qpdf before passing the file to PPX.

How do I resolve opencv version conflicts?

Uninstall all existing opencv variants first, then reinstall:

uv pip uninstall opencv-python opencv-contrib-python \
                  opencv-python-headless opencv-contrib-python-headless
uv pip install opencv-contrib-python --no-config

ImportError: libGL.so.1 on Linux servers

Install the headless OpenCV variant instead:

uv pip install opencv-python-headless

Or install the system library: sudo apt-get install -y libgl1

Can onnxruntime and onnxruntime-gpu coexist?

No. Install exactly one. The GPU variant must match your system's CUDA version.

Can I use PPX on Mac with GPU acceleration?

No. Neither Apple Silicon nor Intel Macs support CUDA. The CPU backend works on both.

Can I embed PPX in a commercial product?

Not under the default license. PPX is free for personal, research, and noncommercial use. For commercial use, contact contact@memect.co.

How do I parse only specific pages?

ppx parse report.pdf --pages "1-5,10,15-20"

Product Experience

Web experience for pdf2x: https://pdf2x.cn/

Apply for a free API key to call the API.

Mini Program experience:

pdf2x Mini Program code


Contributing

We welcome bug reports, feature requests, and pull requests.

  1. Fork the repository and create a feature branch.
  2. Run tests: uv run pytest
  3. Submit a PR — please describe the motivation and include test cases.

See CONTRIBUTING.md for full guidelines.


License

PPX is released under the PolyForm Noncommercial License 1.0.0.

PPX is free for personal, research, and noncommercial use. For commercial use, contact contact@memect.co.

For bundled third-party code and assets, see NOTICE and docs/THIRD_PARTY_LICENSES.md. Those files document attribution and redistribution review items for vendored components and bundled resources shipped with this repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memect_ppx-0.2.0.tar.gz (82.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

memect_ppx-0.2.0-py3-none-any.whl (82.5 MB view details)

Uploaded Python 3

File details

Details for the file memect_ppx-0.2.0.tar.gz.

File metadata

  • Download URL: memect_ppx-0.2.0.tar.gz
  • Upload date:
  • Size: 82.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for memect_ppx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ed357eca09f327e900081b4cae2b3896213c6170f4d860258291c0f367775434
MD5 eadec5276e166b9577fe6a80515654cf
BLAKE2b-256 6489e271456be87e0483df490ddadff7cbf4b19922042a9181c0497c6c719161

See more details on using hashes here.

File details

Details for the file memect_ppx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: memect_ppx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 82.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for memect_ppx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f1f240684f17d74df72a5d61adbad97ab8efc7ddbf839f335167048e155713e
MD5 9b869315951597d85d21a3a71a4a4cbc
BLAKE2b-256 50a32617a2f473950f0d1aa79547c65e31b109bf0a2da420d593a382a8cd2019

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page