Skip to main content

Crop image/table/code regions from PDF files and export metadata

Project description

PDF Image Table Cropper

A pip-installable CLI + SDK tool that detects and crops image / table / code / algorithm regions from PDFs, then exports JPG files and metadata.json.

No OCR. No full-text extraction.

1. Install

Recommended for development:

pip install -e .

Or standard local install:

pip install .

Recommended Python version: >=3.10.

2. Quick Start

CLI mode after installation:

pdf-image-table-cropper \
  -i /path/to/paper.pdf \
  -o ./output

Output path: ./output/<pdf_stem>/.

3. Common Examples

Export table regions only, with selected pages:

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --type table \
  --pages 1-5,8,10

Default model (docling Heron, primary detector):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --type both

Enable OpenDataLab supplementary detection (off by default):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --enable-opendatalab

Enable local daemon mode to reuse loaded models across commands:

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --daemon-mode auto \
  --daemon-idle-seconds 300

Increase rendering quality (slower):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --dpi 300

4. CLI Reference

Required:

  • -i, --input-pdf: input PDF path
  • -o, --output-dir, --storage-root: output root directory (alias)

Core optional arguments:

  • --type {both,image,table,code,algorithm} (default: both)
  • --pages all|1|1-3|1-3,7,10 (default: all)
  • --dpi (default: 200)
  • --imgsz: OpenDataLab supplementary input size (default: 1280)
  • --conf: OpenDataLab supplementary confidence threshold (default: 0.10)
  • --iou: OpenDataLab supplementary IoU threshold (default: 0.45)
  • --device cpu|mps|cuda|cuda:0... (default: auto)
  • --no-merge: disable connected-components merging (enabled by default)
  • --heron-model MODEL_ID: primary docling Heron model (enabled by default)
  • --heron-conf: Heron confidence threshold (default: 0.5)
  • --enable-opendatalab: enable OpenDataLab YOLO supplement (default: off)
  • --metadata-file (default: metadata.json)
  • --daemon-mode {off,auto,on} model lifecycle mode (default: off)
  • --daemon-socket local Unix socket path (default: /tmp/pdf_cropper_modeld_<uid>.sock)
  • --daemon-idle-seconds daemon idle timeout before auto-exit (default: 300)
  • --daemon-start-timeout daemon startup wait timeout in seconds (default: 12)
  • --daemon-run-timeout: timeout for waiting daemon job response; 0 means no timeout (default: 0)

Model download arguments:

  • --model-repo: OpenDataLab supplementary model repo (default: opendatalab/PDF-Extract-Kit-1.0)
  • --model-file: OpenDataLab supplementary model file in snapshot (default: models/Layout/YOLO/doclayout_yolo_docstructbench_imgsz1280_2501.pt)
  • --hf-cache-dir
  • --hf-token

5. Output Layout

output/
└── <pdf_stem>/
    ├── metadata.json
    ├── image/
    ├── table/
    ├── code/
    └── algorithm/

Crop file pattern:

p{page}_{type}_{idx}_{x0}_{y0}_{x1}_{y1}[_merged].jpg

Each crop item in metadata.json includes:

  • content_type
  • page_index, page_number
  • page_size_pdf
  • bbox_pdf (PDF coordinates)
  • bbox_pixels (pixel coordinates)
  • score
  • image_path
  • merged_from (if merged)

6. Tuning

  • Too many misses: lower --conf (e.g. 0.05~0.10)
  • Too many false positives: raise --conf (e.g. 0.15~0.30)
  • Need finer detail: raise --dpi (commonly 300)
  • Too slow or OOM: lower --dpi / --imgsz, or use --device cpu
  • Exporting algorithm requires --enable-opendatalab

7. License (Repository Code + Models)

Repository source code uses MIT License.

Model licenses:

  • docling-project/docling-layout-heron: Apache-2.0
  • opendatalab/PDF-Extract-Kit-1.0: AGPL-3.0(非商用协议, 注意, 按需使用)

8. Documentation Versions

  • Chinese: README.md
  • English (current): README.en.md

9. SDK Usage

from pdf_cropper import CropJobConfig, crop_pdf, crop_pdf_simple

result = crop_pdf_simple(
  input_pdf="paper.pdf",
  output_dir="./output",
  detect_type="table",
)

config = CropJobConfig(
  input_pdf="paper.pdf",
  output_dir="./output",
  detect_type="both",
  pages="all",
  dpi=200,
  enable_opendatalab=False,
)
result = crop_pdf(config)
print(result["metadata_file"])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_image_table_cropper-0.1.1.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_image_table_cropper-0.1.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_image_table_cropper-0.1.1.tar.gz.

File metadata

  • Download URL: pdf_image_table_cropper-0.1.1.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for pdf_image_table_cropper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 351a86f3355f625484ead4faeee95c6c76e4cfb99d4a4afa2c521e92f03c4ec2
MD5 1e5c45a3edfdd3a68509071aac049b28
BLAKE2b-256 4c1abfb0f42b63d2c5e47f07035c1338c404867d7120e1c0a95f56b1354447f2

See more details on using hashes here.

File details

Details for the file pdf_image_table_cropper-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_image_table_cropper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ca1d87d05cdfeed7f5b6dae486f80ca2363f3b56e0a295f6590effd8681239c
MD5 761763d8c9e14e99f983b1f2d29c5467
BLAKE2b-256 5708fb8bae46b22540e00330a4a142983094bd2bde295f7b224d1b3a6be9ebe8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page