Skip to main content

CLI tool to extract embedded images from PDF, DOCX, PPTX and XLSX files.

Project description

Document-Image-Extractor

CLI tool to extract embedded images from DOCX, PDF, PPTX and XLSX files, with deduplication , size filtering, and batch export to ZIPs.


Features:

Extract images from:

  • DOCX (Word documents)
  • PDF (documents)
  • PPTX (Powerpoint documents)
  • XLSX (Excel documents)

Outputs:

  • Creates a ZIP per input file with extracted images

built-in helpers:

  • Deduplication (skips repeated images within the same document)
  • Size filter (min_kb default is 5kb)
  • Handles “no images” and corrupt files gracefully

Project status

this repository is begin improved phase by phase


Requirements

  • python 3.12+ (recomended)

Dependencies (install from 'requirements.txt'):

  • python-docx
  • PyMuPDF
  • pillow

Installation

1. Clone the repository

git clone https://github.com/LeoMurilloDev/document-image-extractor.git
cd document-image-extractor 

2. Create and activate a virtual environment

Windows

python -m  venv .venv
.\.venv\Scripts\activate

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

Usage

Folder structure expected by the script

the script creates these folders automatically if they don't exist:

  • Entrdas_archivos/ -> place your .docx and .pdf files here
  • Salidas_archivos/ -> output ZIPs will be generated here
  • temp/ -> temporary extraction folder (auto-cleaned)

Configuration

You can customize filters without editing the code using config.json (repo root). Example:

{
  "filters": {
    "min_kb": 5,
    "min_width": 0,
    "min_height": 0
  }
}
  • min_kb: minimum file size in kb (default: 5)
  • min_width/ min_height: optional dimension filter (0 disables it)

Run

python main.py

CLI usage

The tool can be used with default folders/config:

python main.py

python main.py --input Entradas_archivos --output Salidas_archivos

python main.py --input example.pptx --output Salidas_archivos

python main.py --input Entradas_archivos --recursive

python main.py --input Entradas_archivos --min-kb 1 --min-width 100 --min-height 100

python main.py --input Entradas_archivos --no-dedup

python main.py --input Entradas_archivos --format folder

python main.py --input Entradas_archivos --log-level DEBUG --log-file logs/debug.log

Output

  • For each input file, a ZIP is created in Salidas_archivos/
  • Example:
    • Input: Entradas_archivos/report.pdf
    • Output: Salidas_archivos/report.zip

What to expect

When you run the script, it prints a summary per file:

  • guardadas -> images saved successfully
  • duplicadas -> images skipped due to hash duplication
  • pequeñas -> images filtered out by size
  • encontradas -> images found inside the document

Important notes

  • In DOCX, images are saved using the real extension (.jpg, .png, .gif, etc)
  • temp/ is cleaned even when a file fails

Test suites

we use small test suites to validate.

Documents to try

Includes:

  • Mixed formats (JPG/PNG/GIF)
  • Duplicates
  • Small icon filtered out by size
  • Corrupt files (error handling) Manual validation steps:
  1. Copy test files into Entradas_archivos/
  2. Run python main.py
  3. Verify
    • Output ZIPs exist in Salidas_archivos/
    • Extencions are correct in DOCX resutls (.jpg, .png, .gif)
    • Duplicates are removed
    • temp/ is empty at the end

Contributing

if you want to propose changes:

  1. Fork the repo
  2. Create a branch
  3. Open a PR with a clear description

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_image_extractor-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_image_extractor-0.1.0-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file document_image_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: document_image_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for document_image_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d2848d874e86b3b029224eabb440827945cf491efd2570cf74634aab57c53662
MD5 1bb6aba9fe0a7c1c88c5f08f1ca94d0f
BLAKE2b-256 9b90c21b022131f86bd95d9896944342712f70d9e12c3b1f6cbd0d9703219ea9

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_image_extractor-0.1.0.tar.gz:

Publisher: publish-pypi.yml on LeoMurilloDev/document-image-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file document_image_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_image_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2855c58a55aadb86293f08ecd76d593b9fb6d01ffc5ee394284d92b4e32317e7
MD5 13b95c93fd48aa80c0d22712df665081
BLAKE2b-256 4da029c28335f7f2c9648511e983aa77c3293feb0cb0ae75c0313ccd075cd14f

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_image_extractor-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on LeoMurilloDev/document-image-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page