Skip to main content

The fastest PDF text and image extraction engine. Rust core + Python bindings.

Project description

fastpdf

世界上最快的 PDF 文本与图像提取引擎。

Rust 核心 + Python 绑定,输出与 PyMuPDF 兼容的 blocksimages 结构。

特性

  • 极致性能:全链路零拷贝 (mmap)、SIMD 字节扫描 (memchr)、快速浮点解析 (fast-float)
  • 不牺牲信息:完整的文本提取链路,包括 CMap、Type0 复合字体、Form XObject 递归
  • 并行处理:rayon 页级并行 + 文件级并行 + 异步预读
  • 健壮容错:xref 损坏时自动 memchr 全文扫描恢复
  • PyMuPDF 兼容:输出结构与 PyMuPDF 完全一致,零迁移成本

安装

pip install fastpdf

从源码构建:

# 需要 Rust 工具链 (https://rustup.rs)
git clone https://github.com/yourname/fastpdf.git
cd fastpdf
pip install maturin
maturin develop --release

快速开始

Python

import fastpdf

# 单文档提取
blocks, images = fastpdf.extract("document.pdf")

for block in blocks:
    for line in block["lines"]:
        for span in line["spans"]:
            print(f"[{span['font']} {span['size']:.0f}] {span['text']}")

for img in images:
    print(f"Image: {img['width']}x{img['height']} {img['ext']}")
    # img['image'] 是原始字节 (JPEG/PNG)

# 批量提取 (文件级并行)
for path, blocks, images in fastpdf.extract_many(
    ["a.pdf", "b.pdf", "c.pdf"],
    file_parallel=True,
    include_images=False
):
    print(f"{path}: {len(blocks)} blocks")

Rust

use fastpdf_core::{extract, ExtractOptions};

let options = ExtractOptions {
    page_parallel: true,
    include_images: true,
    batch_size: 50,
    ..Default::default()
};

let result = extract("document.pdf", &options)?;

for page in &result.pages {
    for block in &page.blocks {
        for line in &block.lines {
            for span in &line.spans {
                println!("[{} {:.0}] {}", span.font, span.size, span.text);
            }
        }
    }
}

API 参考

fastpdf.extract(path, **options)

从单个 PDF 文件提取文本和图像。

参数:

参数 类型 默认值 说明
path str 必填 PDF 文件路径
page_parallel bool True 页级并行(多核加速)
include_images bool True 是否提取图像数据
gpu bool False GPU 加速(需要 NVIDIA GPU)
batch_size int 50 大文档分批大小(0=不分批)

返回值: (blocks, images)

blocks 结构

[
    {
        "type": 0,                    # 0 = 文本块
        "bbox": (x0, y0, x1, y1),    # 块边界框
        "lines": [
            {
                "bbox": (x0, y0, x1, y1),
                "spans": [
                    {
                        "bbox": (x0, y0, x1, y1),
                        "text": "Hello World",
                        "font": "Helvetica",
                        "size": 12.0,
                        "color": 0,
                    }
                ]
            }
        ]
    }
]

images 结构

[
    {
        "bbox": (x0, y0, x1, y1),    # 页面中的位置
        "width": 1920,                # 像素宽度
        "height": 1080,               # 像素高度
        "bpc": 8,                     # 每通道位数
        "colorspace": "DeviceRGB",    # 色彩空间
        "xref": 42,                   # 对象编号
        "ext": "jpeg",                # 格式: jpeg/png/jpx
        "image": b"\xff\xd8\xff...",   # 原始字节 (None 如果 include_images=False)
    }
]

fastpdf.extract_many(paths, **options)

批量提取多个 PDF 文件。

参数:

参数 类型 默认值 说明
paths list[str] 必填 PDF 文件路径列表
file_parallel bool True 文件级并行
page_parallel bool False 页级并行(与 file_parallel 互斥时建议关闭)
include_images bool False 是否提取图像
gpu bool False GPU 加速
batch_size int 50 大文档分批大小

返回值: [(path, blocks, images), ...]

架构

详见 API 文档 获取完整的 API 参考。设计文档见 DESIGN_V1DESIGN_V2

PDF 文件
  │
  ├─ mmap 映射 (零拷贝)
  │
  ├─ 自研解析器 (~800 行)
  │   ├─ 对象解析 (递归下降)
  │   ├─ xref 表/流/ObjStm
  │   └─ memchr fallback (xref 损坏恢复)
  │
  ├─ 内容流状态机
  │   ├─ BT/ET 文本块
  │   ├─ Tj/TJ 文本操作符
  │   ├─ Td/TD/Tm 矩阵变换
  │   ├─ Form XObject 递归 (深度 3)
  │   └─ Do 图像捕获
  │
  ├─ 字体处理
  │   ├─ CMap 解析 (bfchar/bfrange)
  │   ├─ Type0 复合字体 (CIDFont)
  │   ├─ Encoding Differences
  │   └─ Adobe Glyph List
  │
  ├─ 布局分析
  │   └─ chars → spans → lines → blocks
  │
  ├─ 图像提取
  │   ├─ JPEG/JPX 零拷贝 (mmap 切片)
  │   ├─ FlateDecode 惰性 PNG
  │   └─ 四角变换 bbox
  │
  └─ 并行调度
      ├─ rayon 页级并行
      ├─ 文件级并行
      ├─ 异步预读
      └─ 大文档自动分批

性能目标

场景 目标 实际
文本提取 ≥ PyMuPDF 2x ~22x
单词重叠率 ≥ 95% 96.9% (regex-based)
图像元数据 (仅记录偏移) ≥ PyMuPDF 50x N/A¹
图像字节提取 (含解码) ≥ PyMuPDF 5x 11.2x
多文件吞吐量 近核心数线性增长 ~3.2x (10 文件)

¹ fastpdf 的 extract() 是一步到位调用(文本+图像),无法单独衡量图像元数据提取耗时。

详见 性能基准报告

测试

# 运行全部测试
cargo test -p fastpdf-core

# 运行特定测试
cargo test -p fastpdf-core test_cmap

# 性能基准
cargo bench -p fastpdf-core

当前测试:85 个测试全部通过

  • 对象解析器:45 个测试
  • xref + trailer:11 个测试
  • 内容流 + 布局 + 字体 + recovery:26 个测试
  • 流解码器 (LZW/ASCII85/RunLength/ASCIIHex):3 个测试

依赖

Crate 用途
memchr SIMD 字节扫描
fast-float2 快速浮点解析
flate2 zlib 解压
memmap2 零拷贝文件映射
rayon 并行迭代器
pyo3 Python 绑定
crc32fast PNG CRC 校验
fnv 快速哈希
smallvec 小数组优化

路线图

  • 阶段 1: 自研 PDF 解析器
  • 阶段 2: 内容流解析 + 字体处理
  • 阶段 3: 布局分析
  • 阶段 4: 图像提取
  • 阶段 5: 并行化 + I/O 优化
  • 阶段 6: 性能基准 + PyMuPDF 对比测试
  • 阶段 7: PyPI 发布 + CI/CD

详见 TODO.md 获取完整的待完成事项列表。

许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastpdf_tool-0.1.0.tar.gz (58.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl (352.3 kB view details)

Uploaded CPython 3.8+Windows x86-64

fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (517.2 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (499.6 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl (451.5 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl (462.9 kB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file fastpdf_tool-0.1.0.tar.gz.

File metadata

  • Download URL: fastpdf_tool-0.1.0.tar.gz
  • Upload date:
  • Size: 58.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fastpdf_tool-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f65885f885e61a5ea77843567d57853d930ef5d3313aa93bc8b148c184cb4d70
MD5 01463c1ffcca221157600f3b3f735abf
BLAKE2b-256 3be69fb92844379521ff1086f410f05d036968cafc0816cceaf6e2a81f7392f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf_tool-0.1.0.tar.gz:

Publisher: build-wheels.yml on justcodew/fastpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 352.3 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0fe69cf6c62706aa76efb492d3a88f3c9e8288b3e066217d81c346e503e4b2b1
MD5 61d9567b76b09d5b3bd5c5d97d6424f1
BLAKE2b-256 46dbc7e14c4f6871eac126f2bc39397a2196cdc3c80ed1f37dbcc16d55fa1377

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl:

Publisher: build-wheels.yml on justcodew/fastpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 05e975f44f80f4c3be780160b0acd0e74841c524ea21010dd2abbfe8c7e75ee3
MD5 e6e929165b69a81fbe9bd9effa6c0e01
BLAKE2b-256 be1376ff9fada0b673eaf399ff2bdd9e5cd39f1b798b7c6297bc9ed365c09773

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: build-wheels.yml on justcodew/fastpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 684ba8410e752a13780c2f45353a6c24df831bd204c951a99b50c7bbad40d2cc
MD5 53417481437f9329ad945d73ca64d0af
BLAKE2b-256 f3be5f31bf1b360f2e84b5c0a42b4ba04d25cf8c3061cd45d5ac64b19d1666a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: build-wheels.yml on justcodew/fastpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 70e42ade1a7e818689ca1ac93a0d6037ba22a8072882b8ae291253955c59a73e
MD5 0e20b85e23801fe0d9abb71ffbd3e151
BLAKE2b-256 c15c16558b26b3872791f64511e4b70593be652e4c09c3307b7e29cb51286f10

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: build-wheels.yml on justcodew/fastpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d259074a20e90c6a4dbc8f9ad365202aad8ee175394e05eec1d8d8dadc7c2292
MD5 a923e899f4f26bf5f07e248f7de125be
BLAKE2b-256 9716092d354c105e240b5747e50cdfe58234bfa877470ca9b82684f25b193917

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: build-wheels.yml on justcodew/fastpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page