Skip to main content

The fastest PDF text and image extraction engine. Rust core + Python bindings.

Project description

flashpdf

世界上最快的 PDF 文本与图像提取引擎。

Rust 核心 + Python 绑定,输出与 PyMuPDF 兼容的 blocksimages 结构。

特性

  • 极致性能:全链路零拷贝 (mmap)、SIMD 字节扫描 (memchr)、快速浮点解析 (fast-float)
  • 不牺牲信息:完整的文本提取链路,包括 CMap、Type0 复合字体、Form XObject 递归
  • 并行处理:rayon 页级并行 + 文件级并行 + 异步预读
  • 健壮容错:xref 损坏时自动 memchr 全文扫描恢复
  • PyMuPDF 兼容:输出结构与 PyMuPDF 完全一致,零迁移成本

安装

pip install flashpdf

从源码构建:

# 需要 Rust 工具链 (https://rustup.rs)
git clone https://github.com/yourname/flashpdf.git
cd flashpdf
pip install maturin
maturin develop --release

快速开始

Python

import flashpdf

# 单文档提取
blocks, images = flashpdf.extract("document.pdf")

for block in blocks:
    for line in block["lines"]:
        for span in line["spans"]:
            print(f"[{span['font']} {span['size']:.0f}] {span['text']}")

for img in images:
    print(f"Image: {img['width']}x{img['height']} {img['ext']}")
    # img['image'] 是原始字节 (JPEG/PNG)

# 批量提取 (文件级并行)
for path, blocks, images in flashpdf.extract_many(
    ["a.pdf", "b.pdf", "c.pdf"],
    file_parallel=True,
    include_images=False
):
    print(f"{path}: {len(blocks)} blocks")

Rust

use flashpdf_core::{extract, ExtractOptions};

let options = ExtractOptions {
    page_parallel: true,
    include_images: true,
    batch_size: 50,
    ..Default::default()
};

let result = extract("document.pdf", &options)?;

for page in &result.pages {
    for block in &page.blocks {
        for line in &block.lines {
            for span in &line.spans {
                println!("[{} {:.0}] {}", span.font, span.size, span.text);
            }
        }
    }
}

API 参考

flashpdf.extract(path, **options)

从单个 PDF 文件提取文本和图像。

参数:

参数 类型 默认值 说明
path str 必填 PDF 文件路径
page_parallel bool True 页级并行(多核加速)
include_images bool True 是否提取图像数据
gpu bool False GPU 加速(需要 NVIDIA GPU)
batch_size int 50 大文档分批大小(0=不分批)

返回值: (blocks, images)

blocks 结构

[
    {
        "type": 0,                    # 0 = 文本块
        "bbox": (x0, y0, x1, y1),    # 块边界框
        "lines": [
            {
                "bbox": (x0, y0, x1, y1),
                "spans": [
                    {
                        "bbox": (x0, y0, x1, y1),
                        "text": "Hello World",
                        "font": "Helvetica",
                        "size": 12.0,
                        "color": 0,
                    }
                ]
            }
        ]
    }
]

images 结构

[
    {
        "bbox": (x0, y0, x1, y1),    # 页面中的位置
        "width": 1920,                # 像素宽度
        "height": 1080,               # 像素高度
        "bpc": 8,                     # 每通道位数
        "colorspace": "DeviceRGB",    # 色彩空间
        "xref": 42,                   # 对象编号
        "ext": "jpeg",                # 格式: jpeg/png/jpx
        "image": b"\xff\xd8\xff...",   # 原始字节 (None 如果 include_images=False)
    }
]

flashpdf.extract_many(paths, **options)

批量提取多个 PDF 文件。

参数:

参数 类型 默认值 说明
paths list[str] 必填 PDF 文件路径列表
file_parallel bool True 文件级并行
page_parallel bool False 页级并行(与 file_parallel 互斥时建议关闭)
include_images bool False 是否提取图像
gpu bool False GPU 加速
batch_size int 50 大文档分批大小

返回值: [(path, blocks, images), ...]

架构

详见 API 文档 获取完整的 API 参考。设计文档见 DESIGN_V1DESIGN_V2

PDF 文件
  │
  ├─ mmap 映射 (零拷贝)
  │
  ├─ 自研解析器 (~800 行)
  │   ├─ 对象解析 (递归下降)
  │   ├─ xref 表/流/ObjStm
  │   └─ memchr fallback (xref 损坏恢复)
  │
  ├─ 内容流状态机
  │   ├─ BT/ET 文本块
  │   ├─ Tj/TJ 文本操作符
  │   ├─ Td/TD/Tm 矩阵变换
  │   ├─ Form XObject 递归 (深度 3)
  │   └─ Do 图像捕获
  │
  ├─ 字体处理
  │   ├─ CMap 解析 (bfchar/bfrange)
  │   ├─ Type0 复合字体 (CIDFont)
  │   ├─ Encoding Differences
  │   └─ Adobe Glyph List
  │
  ├─ 布局分析
  │   └─ chars → spans → lines → blocks
  │
  ├─ 图像提取
  │   ├─ JPEG/JPX 零拷贝 (mmap 切片)
  │   ├─ FlateDecode 惰性 PNG
  │   └─ 四角变换 bbox
  │
  └─ 并行调度
      ├─ rayon 页级并行
      ├─ 文件级并行
      ├─ 异步预读
      └─ 大文档自动分批

性能目标

场景 目标 实际
文本提取 ≥ PyMuPDF 2x ~22x
单词重叠率 ≥ 95% 96.9% (regex-based)
图像元数据 (仅记录偏移) ≥ PyMuPDF 50x N/A¹
图像字节提取 (含解码) ≥ PyMuPDF 5x 11.2x
多文件吞吐量 近核心数线性增长 ~3.2x (10 文件)

¹ flashpdf 的 extract() 是一步到位调用(文本+图像),无法单独衡量图像元数据提取耗时。

详见 性能基准报告

测试

# 运行全部测试
cargo test -p flashpdf-core

# 运行特定测试
cargo test -p flashpdf-core test_cmap

# 性能基准
cargo bench -p flashpdf-core

当前测试:85 个测试全部通过

  • 对象解析器:45 个测试
  • xref + trailer:11 个测试
  • 内容流 + 布局 + 字体 + recovery:26 个测试
  • 流解码器 (LZW/ASCII85/RunLength/ASCIIHex):3 个测试

依赖

Crate 用途
memchr SIMD 字节扫描
fast-float2 快速浮点解析
flate2 zlib 解压
memmap2 零拷贝文件映射
rayon 并行迭代器
pyo3 Python 绑定
crc32fast PNG CRC 校验
fnv 快速哈希
smallvec 小数组优化

路线图

  • 阶段 1: 自研 PDF 解析器
  • 阶段 2: 内容流解析 + 字体处理
  • 阶段 3: 布局分析
  • 阶段 4: 图像提取
  • 阶段 5: 并行化 + I/O 优化
  • 阶段 6: 性能基准 + PyMuPDF 对比测试
  • 阶段 7: PyPI 发布 + CI/CD

详见 TODO.md 获取完整的待完成事项列表。

许可证

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashpdf-0.1.1.tar.gz (63.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flashpdf-0.1.1-cp39-abi3-win_amd64.whl (355.8 kB view details)

Uploaded CPython 3.9+Windows x86-64

flashpdf-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (520.4 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

flashpdf-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (504.2 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

flashpdf-0.1.1-cp39-abi3-macosx_11_0_arm64.whl (453.5 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

flashpdf-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl (466.8 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file flashpdf-0.1.1.tar.gz.

File metadata

  • Download URL: flashpdf-0.1.1.tar.gz
  • Upload date:
  • Size: 63.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flashpdf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 126bc3505d6279d56bcfe2a261cc87aded151650e34c6c9207884a17dfd13379
MD5 ff4d4a94f4b62eccbc73cfd82265bf58
BLAKE2b-256 ca008c23f07233e70ccbb03886bb9f143a5b55cc87eb5a8e2dedac3f9df24814

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.1.tar.gz:

Publisher: build-wheels.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.1-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: flashpdf-0.1.1-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 355.8 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flashpdf-0.1.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3a08b25eed695f97b332e1e42eb38010638d09a99d8930cf6775495a909930fa
MD5 c5a0d559379348012a58cd0ef583ba1d
BLAKE2b-256 3942d5c86272ea437cec4a7db8192c2645e6338193e08f716d6289ad8f611466

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.1-cp39-abi3-win_amd64.whl:

Publisher: build-wheels.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flashpdf-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 20b9ff22edf152b343517fafb4c0d5bb0f7d7fc61d763e48d24cae6216e1cc73
MD5 3dfb4bbe18e7e35fbb3d2c70bf8a536d
BLAKE2b-256 f36dab66eb28614c419b14dc67700d4401fdd8203349dc199b07893df2d2c8d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: build-wheels.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for flashpdf-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fa8920c93ea811ac3a7f3ba08a7171f9632b5bee3a72b8f46466d175842316e8
MD5 ee54981942953031f43271160f8ffdfd
BLAKE2b-256 4ab682a5103f12b19b6d4f6d30b5a64c2bad6d73fcca552340353af81e8d5b46

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: build-wheels.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for flashpdf-0.1.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 44cc03f159aa657c19c4562049478d16a804a58a978ff8ff30f5a384615a0284
MD5 6ef5646e0586ebe700bf35672c954090
BLAKE2b-256 4d8cdd856f29ffc9468d5a3d4c94a3a672c55a7ab6f04e3ab63706dc22fca806

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.1-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: build-wheels.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flashpdf-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for flashpdf-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1b5a349cd91b835ccad1ba655537076ace04e928cf860754b2d87d72ab85fefb
MD5 729a06f3243bf95af52e2e998c34aae2
BLAKE2b-256 57aa0152d12edbc6b79e5f55dca0496ac919be566e862bb472e427739543c0dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for flashpdf-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: build-wheels.yml on justcodew/flashpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page