Skip to main content

The fastest PDF text and image extraction engine. Rust core + Python bindings.

Project description

flashpdf

世界上最快的 PDF 文本与图像提取引擎。

Rust 核心 + Python 绑定,输出与 PyMuPDF 兼容的 blocksimages 结构。

特性

  • 极致性能:全链路零拷贝 (mmap)、SIMD 字节扫描 (memchr)、快速浮点解析 (fast-float)
  • 不牺牲信息:完整的文本提取链路,包括 CMap、Type0 复合字体、Form XObject 递归
  • 并行处理:rayon 页级并行 + 文件级并行 + 异步预读
  • 健壮容错:xref 损坏时自动 memchr 全文扫描恢复
  • PyMuPDF 兼容:输出结构与 PyMuPDF 完全一致,零迁移成本

安装

pip install flashpdf

从源码构建:

# 需要 Rust 工具链 (https://rustup.rs)
git clone https://github.com/yourname/flashpdf.git
cd flashpdf
pip install maturin
maturin develop --release

快速开始

Python

import flashpdf

# 单文档提取
blocks, images = flashpdf.extract("document.pdf")

for block in blocks:
    for line in block["lines"]:
        for span in line["spans"]:
            print(f"[{span['font']} {span['size']:.0f}] {span['text']}")

for img in images:
    print(f"Image: {img['width']}x{img['height']} {img['ext']}")
    # img['image'] 是原始字节 (JPEG/PNG)

# 批量提取 (文件级并行)
for path, blocks, images in flashpdf.extract_many(
    ["a.pdf", "b.pdf", "c.pdf"],
    file_parallel=True,
    include_images=False
):
    print(f"{path}: {len(blocks)} blocks")

Rust

use flashpdf_core::{extract, ExtractOptions};

let options = ExtractOptions {
    page_parallel: true,
    include_images: true,
    batch_size: 50,
    ..Default::default()
};

let result = extract("document.pdf", &options)?;

for page in &result.pages {
    for block in &page.blocks {
        for line in &block.lines {
            for span in &line.spans {
                println!("[{} {:.0}] {}", span.font, span.size, span.text);
            }
        }
    }
}

API 参考

flashpdf.extract(path, **options)

从单个 PDF 文件提取文本和图像。

参数:

参数 类型 默认值 说明
path str 必填 PDF 文件路径
page_parallel bool True 页级并行(多核加速)
include_images bool True 是否提取图像数据
gpu bool False GPU 加速(需要 NVIDIA GPU)
batch_size int 50 大文档分批大小(0=不分批)

返回值: (blocks, images)

blocks 结构

[
    {
        "type": 0,                    # 0 = 文本块
        "bbox": (x0, y0, x1, y1),    # 块边界框
        "lines": [
            {
                "bbox": (x0, y0, x1, y1),
                "spans": [
                    {
                        "bbox": (x0, y0, x1, y1),
                        "text": "Hello World",
                        "font": "Helvetica",
                        "size": 12.0,
                        "color": 0,
                    }
                ]
            }
        ]
    }
]

images 结构

[
    {
        "bbox": (x0, y0, x1, y1),    # 页面中的位置
        "width": 1920,                # 像素宽度
        "height": 1080,               # 像素高度
        "bpc": 8,                     # 每通道位数
        "colorspace": "DeviceRGB",    # 色彩空间
        "xref": 42,                   # 对象编号
        "ext": "jpeg",                # 格式: jpeg/png/jpx
        "image": b"\xff\xd8\xff...",   # 原始字节 (None 如果 include_images=False)
    }
]

flashpdf.extract_many(paths, **options)

批量提取多个 PDF 文件。

参数:

参数 类型 默认值 说明
paths list[str] 必填 PDF 文件路径列表
file_parallel bool True 文件级并行
page_parallel bool False 页级并行(与 file_parallel 互斥时建议关闭)
include_images bool False 是否提取图像
gpu bool False GPU 加速
batch_size int 50 大文档分批大小

返回值: [(path, blocks, images), ...]

架构

详见 API 文档 获取完整的 API 参考。设计文档见 DESIGN_V1DESIGN_V2

PDF 文件
  │
  ├─ mmap 映射 (零拷贝)
  │
  ├─ 自研解析器 (~800 行)
  │   ├─ 对象解析 (递归下降)
  │   ├─ xref 表/流/ObjStm
  │   └─ memchr fallback (xref 损坏恢复)
  │
  ├─ 内容流状态机
  │   ├─ BT/ET 文本块
  │   ├─ Tj/TJ 文本操作符
  │   ├─ Td/TD/Tm 矩阵变换
  │   ├─ Form XObject 递归 (深度 3)
  │   └─ Do 图像捕获
  │
  ├─ 字体处理
  │   ├─ CMap 解析 (bfchar/bfrange)
  │   ├─ Type0 复合字体 (CIDFont)
  │   ├─ Encoding Differences
  │   └─ Adobe Glyph List
  │
  ├─ 布局分析
  │   └─ chars → spans → lines → blocks
  │
  ├─ 图像提取
  │   ├─ JPEG/JPX 零拷贝 (mmap 切片)
  │   ├─ FlateDecode 惰性 PNG
  │   └─ 四角变换 bbox
  │
  └─ 并行调度
      ├─ rayon 页级并行
      ├─ 文件级并行
      ├─ 异步预读
      └─ 大文档自动分批

性能目标

场景 目标 实际
文本提取 ≥ PyMuPDF 2x ~22x
单词重叠率 ≥ 95% 96.9% (regex-based)
图像元数据 (仅记录偏移) ≥ PyMuPDF 50x N/A¹
图像字节提取 (含解码) ≥ PyMuPDF 5x 11.2x
多文件吞吐量 近核心数线性增长 ~3.2x (10 文件)

¹ flashpdf 的 extract() 是一步到位调用(文本+图像),无法单独衡量图像元数据提取耗时。

详见 性能基准报告

测试

# 运行全部测试
cargo test -p flashpdf-core

# 运行特定测试
cargo test -p flashpdf-core test_cmap

# 性能基准
cargo bench -p flashpdf-core

当前测试:85 个测试全部通过

  • 对象解析器:45 个测试
  • xref + trailer:11 个测试
  • 内容流 + 布局 + 字体 + recovery:26 个测试
  • 流解码器 (LZW/ASCII85/RunLength/ASCIIHex):3 个测试

依赖

Crate 用途
memchr SIMD 字节扫描
fast-float2 快速浮点解析
flate2 zlib 解压
memmap2 零拷贝文件映射
rayon 并行迭代器
pyo3 Python 绑定
crc32fast PNG CRC 校验
fnv 快速哈希
smallvec 小数组优化

路线图

  • 阶段 1: 自研 PDF 解析器
  • 阶段 2: 内容流解析 + 字体处理
  • 阶段 3: 布局分析
  • 阶段 4: 图像提取
  • 阶段 5: 并行化 + I/O 优化
  • 阶段 6: 性能基准 + PyMuPDF 对比测试
  • 阶段 7: PyPI 发布 + CI/CD

详见 TODO.md 获取完整的待完成事项列表。

许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashpdf-0.1.0.tar.gz (59.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flashpdf-0.1.0-cp39-abi3-win_amd64.whl (351.1 kB view details)

Uploaded CPython 3.9+Windows x86-64

flashpdf-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl (517.8 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ x86-64

flashpdf-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (451.2 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file flashpdf-0.1.0.tar.gz.

File metadata

  • Download URL: flashpdf-0.1.0.tar.gz
  • Upload date:
  • Size: 59.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flashpdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1a0bc601f77e0af8c94eadc73380aef5482c9e08ba0540c1173519c59ec48b51
MD5 0bd1058e6c609bd3f7513c241d549e8c
BLAKE2b-256 c2479663046d2b98edcb795c692bacf3c8758eb0581ca0dbacec49213adafe71

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.0.tar.gz:

Publisher: release.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: flashpdf-0.1.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 351.1 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flashpdf-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2052c511dbc4d0abc4ce26da74ead76789410ef4bdb3dd9be32a8514a2818078
MD5 f835b1baa31a70a90640251ef35712de
BLAKE2b-256 37c9f3a9efec39af8598a87ee26f2a835dbe50f194ce5d731963743da055947e

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.0-cp39-abi3-win_amd64.whl:

Publisher: release.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for flashpdf-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 054ea05bee7a35e46c1fe3f2fc886c3ee2fd826786bae167a9dcdb1013ecc6ec
MD5 7c23e5cc8370b66d7a715b8cddb988fc
BLAKE2b-256 fb4843fa281d142ecbdc227154c94abae6a5dc89b5104806a04608d2bff8f693

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for flashpdf-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9ea85beae2d47e21e9a7c40cc9ef4b3464f3a0c9b6ca92e7970017f0d5dd103a
MD5 48fd256f232a299bd0afbd0e36a29746
BLAKE2b-256 148bcd53c0142e0aa811378df41b2833abc3f9343a1fbcaa3f1508a106504b8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page