Skip to main content

Crop image/table/code regions from PDF files and export metadata

Project description

PDF Image Table Cropper

A pip-installable CLI + SDK tool that detects and crops image / table / code / algorithm regions from PDFs, then exports JPG files and metadata.json.

No OCR. No full-text extraction.

1. Install

Recommended for development:

pip install -e .

Or standard local install:

pip install .

Recommended Python version: >=3.10.

2. Quick Start

CLI mode after installation:

pdf-image-table-cropper \
  -i /path/to/paper.pdf \
  -o ./output

Output path: ./output/<pdf_stem>/.

3. Common Examples

Export table regions only, with selected pages:

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --type table \
  --pages 1-5,8,10

Default model (docling Heron, primary detector):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --type both

Enable OpenDataLab supplementary detection (off by default):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --enable-opendatalab

Enable local daemon mode to reuse loaded models across commands:

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --daemon-mode auto \
  --daemon-idle-seconds 300

Increase rendering quality (slower):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --dpi 300

4. CLI Reference

Required:

  • -i, --input-pdf: input PDF path
  • -o, --output-dir, --storage-root: output root directory (alias)

Core optional arguments:

  • --type {both,image,table,code,algorithm} (default: both)
  • --pages all|1|1-3|1-3,7,10 (default: all)
  • --dpi (default: 200)
  • --imgsz: OpenDataLab supplementary input size (default: 1280)
  • --conf: OpenDataLab supplementary confidence threshold (default: 0.10)
  • --iou: OpenDataLab supplementary IoU threshold (default: 0.45)
  • --device cpu|mps|cuda|cuda:0... (default: auto)
  • --no-merge: disable connected-components merging (enabled by default)
  • --heron-model MODEL_ID: primary docling Heron model (enabled by default)
  • --heron-conf: Heron confidence threshold (default: 0.5)
  • --enable-opendatalab: enable OpenDataLab YOLO supplement (default: off)
  • --metadata-file (default: metadata.json)
  • --daemon-mode {off,auto,on} model lifecycle mode (default: off)
  • --daemon-socket local Unix socket path (default: /tmp/pdf_cropper_modeld_<uid>.sock)
  • --daemon-idle-seconds daemon idle timeout before auto-exit (default: 300)
  • --daemon-start-timeout daemon startup wait timeout in seconds (default: 12)
  • --daemon-run-timeout: timeout for waiting daemon job response; 0 means no timeout (default: 0)

Model download arguments:

  • --model-repo: OpenDataLab supplementary model repo (default: opendatalab/PDF-Extract-Kit-1.0)
  • --model-file: OpenDataLab supplementary model file in snapshot (default: models/Layout/YOLO/doclayout_yolo_docstructbench_imgsz1280_2501.pt)
  • --hf-cache-dir
  • --hf-token

5. Output Layout

output/
└── <pdf_stem>/
    ├── metadata.json
    ├── image/
    ├── table/
    ├── code/
    └── algorithm/

Crop file pattern:

p{page}_{type}_{idx}_{x0}_{y0}_{x1}_{y1}[_merged].jpg

Each crop item in metadata.json includes:

  • content_type
  • page_index, page_number
  • page_size_pdf
  • bbox_pdf (PDF coordinates)
  • bbox_pixels (pixel coordinates)
  • score
  • image_path
  • merged_from (if merged)

6. Tuning

  • Too many misses: lower --conf (e.g. 0.05~0.10)
  • Too many false positives: raise --conf (e.g. 0.15~0.30)
  • Need finer detail: raise --dpi (commonly 300)
  • Too slow or OOM: lower --dpi / --imgsz, or use --device cpu
  • Exporting algorithm requires --enable-opendatalab

7. License (Repository Code + Models)

Repository source code uses MIT License.

Model licenses:

  • docling-project/docling-layout-heron: Apache-2.0
  • opendatalab/PDF-Extract-Kit-1.0: AGPL-3.0(非商用协议, 注意, 按需使用)

8. Documentation Versions

  • Chinese: README.md
  • English (current): README.en.md

9. SDK Usage

from pdf_cropper import CropJobConfig, crop_pdf, crop_pdf_simple

result = crop_pdf_simple(
  input_pdf="paper.pdf",
  output_dir="./output",
  detect_type="table",
)

config = CropJobConfig(
  input_pdf="paper.pdf",
  output_dir="./output",
  detect_type="both",
  pages="all",
  dpi=200,
  enable_opendatalab=False,
)
result = crop_pdf(config)
print(result["metadata_file"])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_image_table_cropper-0.1.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_image_table_cropper-0.1.0-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_image_table_cropper-0.1.0.tar.gz.

File metadata

  • Download URL: pdf_image_table_cropper-0.1.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for pdf_image_table_cropper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ec95f0be1b4ded213f184e4dc28b94c94ffe59af71d6fbf38bcceb12bed00146
MD5 f9a74a57810042037a96b7152ad92ce9
BLAKE2b-256 c3e11d158ddffb29f121ffb32f36271478a3121ea35272d3fdef25027499c1eb

See more details on using hashes here.

File details

Details for the file pdf_image_table_cropper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_image_table_cropper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d608bd49154f4d8eaa57931a7da497bc9a884e451804d1791f7a8438b27c58e5
MD5 4ec90b9e3f31c66f824d26118368892d
BLAKE2b-256 f766bda7a9e2337593a9743313ed7304e4ea94a7a3731f1b467913dd310b6045

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page