Crop image/table/code regions from PDF files and export metadata
Project description
PDF Image Table Cropper
A pip-installable CLI + SDK tool that detects and crops
image / table / code / algorithm regions from PDFs,
then exports JPG files and metadata.json.
No OCR. No full-text extraction.
1. Install
Install from PyPI (recommended for production):
pip install pdf-image-table-cropper
Recommended for development:
pip install -e .
Or standard local install:
pip install .
Recommended Python version: >=3.10.
2. Quick Start
CLI mode after installation:
pdf-image-table-cropper \
-i /path/to/paper.pdf \
-o ./output
Output path: ./output/<pdf_stem>/.
3. Common Examples
Export table regions only, with selected pages:
pdf-image-table-cropper \
-i paper.pdf \
-o ./output \
--type table \
--pages 1-5,8,10
Default model (docling Heron, primary detector):
pdf-image-table-cropper \
-i paper.pdf \
-o ./output \
--type both
Enable OpenDataLab supplementary detection (off by default):
pdf-image-table-cropper \
-i paper.pdf \
-o ./output \
--enable-opendatalab
Enable local daemon mode to reuse loaded models across commands:
pdf-image-table-cropper \
-i paper.pdf \
-o ./output \
--daemon-mode auto \
--daemon-idle-seconds 300
Increase rendering quality (slower):
pdf-image-table-cropper \
-i paper.pdf \
-o ./output \
--dpi 300
4. CLI Reference
Required:
-i, --input-pdf: input PDF path-o, --output-dir, --storage-root: output root directory (alias)
Core optional arguments:
--type {both,image,table,code,algorithm}(default:both)--pages all|1|1-3|1-3,7,10(default:all)--dpi(default:200)--imgsz: OpenDataLab supplementary input size (default:1280)--conf: OpenDataLab supplementary confidence threshold (default:0.10)--iou: OpenDataLab supplementary IoU threshold (default:0.45)--device cpu|mps|cuda|cuda:0...(default: auto)--no-merge: disable connected-components merging (enabled by default)--heron-model MODEL_ID: primary docling Heron model (enabled by default)--heron-conf: Heron confidence threshold (default:0.5)--enable-opendatalab: enable OpenDataLab YOLO supplement (default: off)--metadata-file(default:metadata.json)--daemon-mode {off,auto,on}model lifecycle mode (default:off)--daemon-socketlocal Unix socket path (default:/tmp/pdf_cropper_modeld_<uid>.sock)--daemon-idle-secondsdaemon idle timeout before auto-exit (default:300)--daemon-start-timeoutdaemon startup wait timeout in seconds (default:12)--daemon-run-timeout: timeout for waiting daemon job response;0means no timeout (default:0)
Model download arguments:
--model-repo: OpenDataLab supplementary model repo (default:opendatalab/PDF-Extract-Kit-1.0)--model-file: OpenDataLab supplementary model file in snapshot (default:models/Layout/YOLO/doclayout_yolo_docstructbench_imgsz1280_2501.pt)--hf-cache-dir--hf-token
5. Output Layout
output/
└── <pdf_stem>/
├── metadata.json
├── image/
├── table/
├── code/
└── algorithm/
Crop file pattern:
p{page}_{type}_{idx}_{x0}_{y0}_{x1}_{y1}[_merged].jpg
Each crop item in metadata.json includes:
content_typepage_index,page_numberpage_size_pdfbbox_pdf(PDF coordinates)bbox_pixels(pixel coordinates)scoreimage_pathmerged_from(if merged)
6. Tuning
- Too many misses: lower
--conf(e.g.0.05~0.10) - Too many false positives: raise
--conf(e.g.0.15~0.30) - Need finer detail: raise
--dpi(commonly300) - Too slow or OOM: lower
--dpi/--imgsz, or use--device cpu - Exporting
algorithmrequires--enable-opendatalab
7. License (Repository Code + Models)
Repository source code uses MIT License.
Model licenses:
docling-project/docling-layout-heron: Apache-2.0opendatalab/PDF-Extract-Kit-1.0: AGPL-3.0(非商用协议, 注意, 按需使用)
8. Documentation Versions
- Chinese:
README.md - English (current):
README.en.md
9. SDK Usage
from pdf_cropper import CropJobConfig, crop_pdf, crop_pdf_simple
result = crop_pdf_simple(
input_pdf="paper.pdf",
output_dir="./output",
detect_type="table",
)
config = CropJobConfig(
input_pdf="paper.pdf",
output_dir="./output",
detect_type="both",
pages="all",
dpi=200,
enable_opendatalab=False,
)
result = crop_pdf(config)
print(result["metadata_file"])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_image_table_cropper-0.1.2.tar.gz.
File metadata
- Download URL: pdf_image_table_cropper-0.1.2.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcfb14b8adf0ea691d1eb67ec02eab508df554694be0698945773bb3fb4da9be
|
|
| MD5 |
f53483696e7ae2e4981cff0689a835e2
|
|
| BLAKE2b-256 |
1b5637ecb07f5f8df10a4879173ed22545ba57c39930fb0076bcc70aa96be551
|
Provenance
The following attestation bundles were made for pdf_image_table_cropper-0.1.2.tar.gz:
Publisher:
release.yml on meomeo-dev/pdf_image_table_cropper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf_image_table_cropper-0.1.2.tar.gz -
Subject digest:
dcfb14b8adf0ea691d1eb67ec02eab508df554694be0698945773bb3fb4da9be - Sigstore transparency entry: 1101566106
- Sigstore integration time:
-
Permalink:
meomeo-dev/pdf_image_table_cropper@cb3d6cc79fcb4dd924145f157927fad32426b82d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/meomeo-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cb3d6cc79fcb4dd924145f157927fad32426b82d -
Trigger Event:
push
-
Statement type:
File details
Details for the file pdf_image_table_cropper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pdf_image_table_cropper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41e82784152a0f666249a3ed0a1259185081b1b8d1d656d1232693f4079155ba
|
|
| MD5 |
0ac0f4b9f47994a0702fc19f27bf7a10
|
|
| BLAKE2b-256 |
677f263ebe8841f35cc8b151fccd4c5e07049c0599c78fc593bfef2cea9a5f02
|
Provenance
The following attestation bundles were made for pdf_image_table_cropper-0.1.2-py3-none-any.whl:
Publisher:
release.yml on meomeo-dev/pdf_image_table_cropper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf_image_table_cropper-0.1.2-py3-none-any.whl -
Subject digest:
41e82784152a0f666249a3ed0a1259185081b1b8d1d656d1232693f4079155ba - Sigstore transparency entry: 1101566107
- Sigstore integration time:
-
Permalink:
meomeo-dev/pdf_image_table_cropper@cb3d6cc79fcb4dd924145f157927fad32426b82d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/meomeo-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cb3d6cc79fcb4dd924145f157927fad32426b82d -
Trigger Event:
push
-
Statement type: