Skip to main content

Crop image/table/code regions from PDF files and export metadata

Project description

PDF Image Table Cropper

A pip-installable CLI + SDK tool that detects and crops image / table / code / algorithm regions from PDFs, then exports JPG files and metadata.json.

No OCR. No full-text extraction.

1. Install

Install from PyPI (recommended for production):

pip install pdf-image-table-cropper

Recommended for development:

pip install -e .

Or standard local install:

pip install .

Recommended Python version: >=3.10.

2. Quick Start

CLI mode after installation:

pdf-image-table-cropper \
  -i /path/to/paper.pdf \
  -o ./output

Output path: ./output/<pdf_stem>/.

3. Common Examples

Export table regions only, with selected pages:

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --type table \
  --pages 1-5,8,10

Default model (docling Heron, primary detector):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --type both

Enable OpenDataLab supplementary detection (off by default):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --enable-opendatalab

Enable local daemon mode to reuse loaded models across commands:

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --daemon-mode auto \
  --daemon-idle-seconds 300

Increase rendering quality (slower):

pdf-image-table-cropper \
  -i paper.pdf \
  -o ./output \
  --dpi 300

4. CLI Reference

Required:

  • -i, --input-pdf: input PDF path
  • -o, --output-dir, --storage-root: output root directory (alias)

Core optional arguments:

  • --type {both,image,table,code,algorithm} (default: both)
  • --pages all|1|1-3|1-3,7,10 (default: all)
  • --dpi (default: 200)
  • --imgsz: OpenDataLab supplementary input size (default: 1280)
  • --conf: OpenDataLab supplementary confidence threshold (default: 0.10)
  • --iou: OpenDataLab supplementary IoU threshold (default: 0.45)
  • --device cpu|mps|cuda|cuda:0... (default: auto)
  • --no-merge: disable connected-components merging (enabled by default)
  • --heron-model MODEL_ID: primary docling Heron model (enabled by default)
  • --heron-conf: Heron confidence threshold (default: 0.5)
  • --enable-opendatalab: enable OpenDataLab YOLO supplement (default: off)
  • --metadata-file (default: metadata.json)
  • --daemon-mode {off,auto,on} model lifecycle mode (default: off)
  • --daemon-socket local Unix socket path (default: /tmp/pdf_cropper_modeld_<uid>.sock)
  • --daemon-idle-seconds daemon idle timeout before auto-exit (default: 300)
  • --daemon-start-timeout daemon startup wait timeout in seconds (default: 12)
  • --daemon-run-timeout: timeout for waiting daemon job response; 0 means no timeout (default: 0)

Model download arguments:

  • --model-repo: OpenDataLab supplementary model repo (default: opendatalab/PDF-Extract-Kit-1.0)
  • --model-file: OpenDataLab supplementary model file in snapshot (default: models/Layout/YOLO/doclayout_yolo_docstructbench_imgsz1280_2501.pt)
  • --hf-cache-dir
  • --hf-token

5. Output Layout

output/
└── <pdf_stem>/
    ├── metadata.json
    ├── image/
    ├── table/
    ├── code/
    └── algorithm/

Crop file pattern:

p{page}_{type}_{idx}_{x0}_{y0}_{x1}_{y1}[_merged].jpg

Each crop item in metadata.json includes:

  • content_type
  • page_index, page_number
  • page_size_pdf
  • bbox_pdf (PDF coordinates)
  • bbox_pixels (pixel coordinates)
  • score
  • image_path
  • merged_from (if merged)

6. Tuning

  • Too many misses: lower --conf (e.g. 0.05~0.10)
  • Too many false positives: raise --conf (e.g. 0.15~0.30)
  • Need finer detail: raise --dpi (commonly 300)
  • Too slow or OOM: lower --dpi / --imgsz, or use --device cpu
  • Exporting algorithm requires --enable-opendatalab

7. License (Repository Code + Models)

Repository source code uses MIT License.

Model licenses:

  • docling-project/docling-layout-heron: Apache-2.0
  • opendatalab/PDF-Extract-Kit-1.0: AGPL-3.0(非商用协议, 注意, 按需使用)

8. Documentation Versions

  • Chinese: README.md
  • English (current): README.en.md

9. SDK Usage

from pdf_cropper import CropJobConfig, crop_pdf, crop_pdf_simple

result = crop_pdf_simple(
  input_pdf="paper.pdf",
  output_dir="./output",
  detect_type="table",
)

config = CropJobConfig(
  input_pdf="paper.pdf",
  output_dir="./output",
  detect_type="both",
  pages="all",
  dpi=200,
  enable_opendatalab=False,
)
result = crop_pdf(config)
print(result["metadata_file"])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_image_table_cropper-0.1.2.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_image_table_cropper-0.1.2-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf_image_table_cropper-0.1.2.tar.gz.

File metadata

  • Download URL: pdf_image_table_cropper-0.1.2.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf_image_table_cropper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 dcfb14b8adf0ea691d1eb67ec02eab508df554694be0698945773bb3fb4da9be
MD5 f53483696e7ae2e4981cff0689a835e2
BLAKE2b-256 1b5637ecb07f5f8df10a4879173ed22545ba57c39930fb0076bcc70aa96be551

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_image_table_cropper-0.1.2.tar.gz:

Publisher: release.yml on meomeo-dev/pdf_image_table_cropper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf_image_table_cropper-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_image_table_cropper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 41e82784152a0f666249a3ed0a1259185081b1b8d1d656d1232693f4079155ba
MD5 0ac0f4b9f47994a0702fc19f27bf7a10
BLAKE2b-256 677f263ebe8841f35cc8b151fccd4c5e07049c0599c78fc593bfef2cea9a5f02

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_image_table_cropper-0.1.2-py3-none-any.whl:

Publisher: release.yml on meomeo-dev/pdf_image_table_cropper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page