The fastest PDF text and image extraction engine. Rust core + Python bindings.
Project description
fastpdf
世界上最快的 PDF 文本与图像提取引擎。
Rust 核心 + Python 绑定,输出与 PyMuPDF 兼容的 blocks 和 images 结构。
特性
- 极致性能:全链路零拷贝 (mmap)、SIMD 字节扫描 (
memchr)、快速浮点解析 (fast-float) - 不牺牲信息:完整的文本提取链路,包括 CMap、Type0 复合字体、Form XObject 递归
- 并行处理:rayon 页级并行 + 文件级并行 + 异步预读
- 健壮容错:xref 损坏时自动 memchr 全文扫描恢复
- PyMuPDF 兼容:输出结构与 PyMuPDF 完全一致,零迁移成本
安装
pip install fastpdf
从源码构建:
# 需要 Rust 工具链 (https://rustup.rs)
git clone https://github.com/yourname/fastpdf.git
cd fastpdf
pip install maturin
maturin develop --release
快速开始
Python
import fastpdf
# 单文档提取
blocks, images = fastpdf.extract("document.pdf")
for block in blocks:
for line in block["lines"]:
for span in line["spans"]:
print(f"[{span['font']} {span['size']:.0f}] {span['text']}")
for img in images:
print(f"Image: {img['width']}x{img['height']} {img['ext']}")
# img['image'] 是原始字节 (JPEG/PNG)
# 批量提取 (文件级并行)
for path, blocks, images in fastpdf.extract_many(
["a.pdf", "b.pdf", "c.pdf"],
file_parallel=True,
include_images=False
):
print(f"{path}: {len(blocks)} blocks")
Rust
use fastpdf_core::{extract, ExtractOptions};
let options = ExtractOptions {
page_parallel: true,
include_images: true,
batch_size: 50,
..Default::default()
};
let result = extract("document.pdf", &options)?;
for page in &result.pages {
for block in &page.blocks {
for line in &block.lines {
for span in &line.spans {
println!("[{} {:.0}] {}", span.font, span.size, span.text);
}
}
}
}
API 参考
fastpdf.extract(path, **options)
从单个 PDF 文件提取文本和图像。
参数:
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
path |
str |
必填 | PDF 文件路径 |
page_parallel |
bool |
True |
页级并行(多核加速) |
include_images |
bool |
True |
是否提取图像数据 |
gpu |
bool |
False |
GPU 加速(需要 NVIDIA GPU) |
batch_size |
int |
50 |
大文档分批大小(0=不分批) |
返回值: (blocks, images)
blocks 结构
[
{
"type": 0, # 0 = 文本块
"bbox": (x0, y0, x1, y1), # 块边界框
"lines": [
{
"bbox": (x0, y0, x1, y1),
"spans": [
{
"bbox": (x0, y0, x1, y1),
"text": "Hello World",
"font": "Helvetica",
"size": 12.0,
"color": 0,
}
]
}
]
}
]
images 结构
[
{
"bbox": (x0, y0, x1, y1), # 页面中的位置
"width": 1920, # 像素宽度
"height": 1080, # 像素高度
"bpc": 8, # 每通道位数
"colorspace": "DeviceRGB", # 色彩空间
"xref": 42, # 对象编号
"ext": "jpeg", # 格式: jpeg/png/jpx
"image": b"\xff\xd8\xff...", # 原始字节 (None 如果 include_images=False)
}
]
fastpdf.extract_many(paths, **options)
批量提取多个 PDF 文件。
参数:
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
paths |
list[str] |
必填 | PDF 文件路径列表 |
file_parallel |
bool |
True |
文件级并行 |
page_parallel |
bool |
False |
页级并行(与 file_parallel 互斥时建议关闭) |
include_images |
bool |
False |
是否提取图像 |
gpu |
bool |
False |
GPU 加速 |
batch_size |
int |
50 |
大文档分批大小 |
返回值: [(path, blocks, images), ...]
架构
详见 API 文档 获取完整的 API 参考。设计文档见 DESIGN_V1 和 DESIGN_V2。
PDF 文件
│
├─ mmap 映射 (零拷贝)
│
├─ 自研解析器 (~800 行)
│ ├─ 对象解析 (递归下降)
│ ├─ xref 表/流/ObjStm
│ └─ memchr fallback (xref 损坏恢复)
│
├─ 内容流状态机
│ ├─ BT/ET 文本块
│ ├─ Tj/TJ 文本操作符
│ ├─ Td/TD/Tm 矩阵变换
│ ├─ Form XObject 递归 (深度 3)
│ └─ Do 图像捕获
│
├─ 字体处理
│ ├─ CMap 解析 (bfchar/bfrange)
│ ├─ Type0 复合字体 (CIDFont)
│ ├─ Encoding Differences
│ └─ Adobe Glyph List
│
├─ 布局分析
│ └─ chars → spans → lines → blocks
│
├─ 图像提取
│ ├─ JPEG/JPX 零拷贝 (mmap 切片)
│ ├─ FlateDecode 惰性 PNG
│ └─ 四角变换 bbox
│
└─ 并行调度
├─ rayon 页级并行
├─ 文件级并行
├─ 异步预读
└─ 大文档自动分批
性能目标
| 场景 | 目标 | 实际 |
|---|---|---|
| 文本提取 | ≥ PyMuPDF 2x | ~22x |
| 单词重叠率 | ≥ 95% | 96.9% (regex-based) |
| 图像元数据 (仅记录偏移) | ≥ PyMuPDF 50x | N/A¹ |
| 图像字节提取 (含解码) | ≥ PyMuPDF 5x | 11.2x ✅ |
| 多文件吞吐量 | 近核心数线性增长 | ~3.2x (10 文件) |
¹ fastpdf 的
extract()是一步到位调用(文本+图像),无法单独衡量图像元数据提取耗时。
详见 性能基准报告。
测试
# 运行全部测试
cargo test -p fastpdf-core
# 运行特定测试
cargo test -p fastpdf-core test_cmap
# 性能基准
cargo bench -p fastpdf-core
当前测试:85 个测试全部通过 ✅
- 对象解析器:45 个测试
- xref + trailer:11 个测试
- 内容流 + 布局 + 字体 + recovery:26 个测试
- 流解码器 (LZW/ASCII85/RunLength/ASCIIHex):3 个测试
依赖
| Crate | 用途 |
|---|---|
memchr |
SIMD 字节扫描 |
fast-float2 |
快速浮点解析 |
flate2 |
zlib 解压 |
memmap2 |
零拷贝文件映射 |
rayon |
并行迭代器 |
pyo3 |
Python 绑定 |
crc32fast |
PNG CRC 校验 |
fnv |
快速哈希 |
smallvec |
小数组优化 |
路线图
- 阶段 1: 自研 PDF 解析器
- 阶段 2: 内容流解析 + 字体处理
- 阶段 3: 布局分析
- 阶段 4: 图像提取
- 阶段 5: 并行化 + I/O 优化
- 阶段 6: 性能基准 + PyMuPDF 对比测试
- 阶段 7: PyPI 发布 + CI/CD
详见 TODO.md 获取完整的待完成事项列表。
许可证
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastpdf_tool-0.1.0.tar.gz.
File metadata
- Download URL: fastpdf_tool-0.1.0.tar.gz
- Upload date:
- Size: 58.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f65885f885e61a5ea77843567d57853d930ef5d3313aa93bc8b148c184cb4d70
|
|
| MD5 |
01463c1ffcca221157600f3b3f735abf
|
|
| BLAKE2b-256 |
3be69fb92844379521ff1086f410f05d036968cafc0816cceaf6e2a81f7392f3
|
Provenance
The following attestation bundles were made for fastpdf_tool-0.1.0.tar.gz:
Publisher:
build-wheels.yml on justcodew/fastpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastpdf_tool-0.1.0.tar.gz -
Subject digest:
f65885f885e61a5ea77843567d57853d930ef5d3313aa93bc8b148c184cb4d70 - Sigstore transparency entry: 1839718927
- Sigstore integration time:
-
Permalink:
justcodew/fastpdf@590100528b9784a9e0f080b2186dffe8e7410675 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/justcodew
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@590100528b9784a9e0f080b2186dffe8e7410675 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 352.3 kB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fe69cf6c62706aa76efb492d3a88f3c9e8288b3e066217d81c346e503e4b2b1
|
|
| MD5 |
61d9567b76b09d5b3bd5c5d97d6424f1
|
|
| BLAKE2b-256 |
46dbc7e14c4f6871eac126f2bc39397a2196cdc3c80ed1f37dbcc16d55fa1377
|
Provenance
The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl:
Publisher:
build-wheels.yml on justcodew/fastpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastpdf_tool-0.1.0-cp38-abi3-win_amd64.whl -
Subject digest:
0fe69cf6c62706aa76efb492d3a88f3c9e8288b3e066217d81c346e503e4b2b1 - Sigstore transparency entry: 1839719064
- Sigstore integration time:
-
Permalink:
justcodew/fastpdf@590100528b9784a9e0f080b2186dffe8e7410675 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/justcodew
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@590100528b9784a9e0f080b2186dffe8e7410675 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 517.2 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05e975f44f80f4c3be780160b0acd0e74841c524ea21010dd2abbfe8c7e75ee3
|
|
| MD5 |
e6e929165b69a81fbe9bd9effa6c0e01
|
|
| BLAKE2b-256 |
be1376ff9fada0b673eaf399ff2bdd9e5cd39f1b798b7c6297bc9ed365c09773
|
Provenance
The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
build-wheels.yml on justcodew/fastpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
05e975f44f80f4c3be780160b0acd0e74841c524ea21010dd2abbfe8c7e75ee3 - Sigstore transparency entry: 1839719706
- Sigstore integration time:
-
Permalink:
justcodew/fastpdf@590100528b9784a9e0f080b2186dffe8e7410675 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/justcodew
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@590100528b9784a9e0f080b2186dffe8e7410675 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 499.6 kB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
684ba8410e752a13780c2f45353a6c24df831bd204c951a99b50c7bbad40d2cc
|
|
| MD5 |
53417481437f9329ad945d73ca64d0af
|
|
| BLAKE2b-256 |
f3be5f31bf1b360f2e84b5c0a42b4ba04d25cf8c3061cd45d5ac64b19d1666a0
|
Provenance
The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
build-wheels.yml on justcodew/fastpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastpdf_tool-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
684ba8410e752a13780c2f45353a6c24df831bd204c951a99b50c7bbad40d2cc - Sigstore transparency entry: 1839720486
- Sigstore integration time:
-
Permalink:
justcodew/fastpdf@590100528b9784a9e0f080b2186dffe8e7410675 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/justcodew
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@590100528b9784a9e0f080b2186dffe8e7410675 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 451.5 kB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e42ade1a7e818689ca1ac93a0d6037ba22a8072882b8ae291253955c59a73e
|
|
| MD5 |
0e20b85e23801fe0d9abb71ffbd3e151
|
|
| BLAKE2b-256 |
c15c16558b26b3872791f64511e4b70593be652e4c09c3307b7e29cb51286f10
|
Provenance
The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
build-wheels.yml on justcodew/fastpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastpdf_tool-0.1.0-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
70e42ade1a7e818689ca1ac93a0d6037ba22a8072882b8ae291253955c59a73e - Sigstore transparency entry: 1839719983
- Sigstore integration time:
-
Permalink:
justcodew/fastpdf@590100528b9784a9e0f080b2186dffe8e7410675 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/justcodew
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@590100528b9784a9e0f080b2186dffe8e7410675 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 462.9 kB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d259074a20e90c6a4dbc8f9ad365202aad8ee175394e05eec1d8d8dadc7c2292
|
|
| MD5 |
a923e899f4f26bf5f07e248f7de125be
|
|
| BLAKE2b-256 |
9716092d354c105e240b5747e50cdfe58234bfa877470ca9b82684f25b193917
|
Provenance
The following attestation bundles were made for fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl:
Publisher:
build-wheels.yml on justcodew/fastpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fastpdf_tool-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl -
Subject digest:
d259074a20e90c6a4dbc8f9ad365202aad8ee175394e05eec1d8d8dadc7c2292 - Sigstore transparency entry: 1839719830
- Sigstore integration time:
-
Permalink:
justcodew/fastpdf@590100528b9784a9e0f080b2186dffe8e7410675 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/justcodew
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@590100528b9784a9e0f080b2186dffe8e7410675 -
Trigger Event:
push
-
Statement type: