CLI tool to extract embedded images from PDF, DOCX, PPTX and XLSX files.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Document-Image-Extractor

CLI tool to extract embedded images from DOCX, PDF, PPTX and XLSX files, with deduplication , size filtering, and batch export to ZIPs.

Features:

Extract images from:

DOCX (Word documents)
PDF (documents)
PPTX (Powerpoint documents)
XLSX (Excel documents)

Outputs:

Creates a ZIP per input file with extracted images

built-in helpers:

Deduplication (skips repeated images within the same document)
Size filter (min_kb default is 5kb)
Handles “no images” and corrupt files gracefully

Project status

this repository is begin improved phase by phase

Requirements

python 3.12+ (recomended)

Dependencies (install from 'requirements.txt'):

python-docx
PyMuPDF
pillow

Installation

1. Clone the repository

git clone https://github.com/LeoMurilloDev/document-image-extractor.git
cd document-image-extractor

2. Create and activate a virtual environment

Windows

python -m  venv .venv
.\.venv\Scripts\activate

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

Usage

Folder structure expected by the script

the script creates these folders automatically if they don't exist:

Entrdas_archivos/ -> place your .docx and .pdf files here
Salidas_archivos/ -> output ZIPs will be generated here
temp/ -> temporary extraction folder (auto-cleaned)

Configuration

You can customize filters without editing the code using config.json (repo root). Example:

{
  "filters": {
    "min_kb": 5,
    "min_width": 0,
    "min_height": 0
  }
}

min_kb: minimum file size in kb (default: 5)
min_width/ min_height: optional dimension filter (0 disables it)

Run

python main.py

CLI usage

The tool can be used with default folders/config:

python main.py

python main.py --input Entradas_archivos --output Salidas_archivos

python main.py --input example.pptx --output Salidas_archivos

python main.py --input Entradas_archivos --recursive

python main.py --input Entradas_archivos --min-kb 1 --min-width 100 --min-height 100

python main.py --input Entradas_archivos --no-dedup

python main.py --input Entradas_archivos --format folder

python main.py --input Entradas_archivos --log-level DEBUG --log-file logs/debug.log

Output

For each input file, a ZIP is created in Salidas_archivos/
Example:
- Input: Entradas_archivos/report.pdf
- Output: Salidas_archivos/report.zip

What to expect

When you run the script, it prints a summary per file:

guardadas -> images saved successfully
duplicadas -> images skipped due to hash duplication
pequeñas -> images filtered out by size
encontradas -> images found inside the document

Important notes

In DOCX, images are saved using the real extension (.jpg, .png, .gif, etc)
temp/ is cleaned even when a file fails

Test suites

we use small test suites to validate.

Documents to try

Includes:

Mixed formats (JPG/PNG/GIF)
Duplicates
Small icon filtered out by size
Corrupt files (error handling) Manual validation steps:

Copy test files into Entradas_archivos/
Run python main.py
Verify
- Output ZIPs exist in Salidas_archivos/
- Extencions are correct in DOCX resutls (.jpg, .png, .gif)
- Duplicates are removed
- temp/ is empty at the end

Contributing

if you want to propose changes:

Fork the repo
Create a branch
Open a PR with a clear description

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

LeoMurilloDev

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_image_extractor-0.1.0.tar.gz (16.3 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_image_extractor-0.1.0-py3-none-any.whl (17.6 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file document_image_extractor-0.1.0.tar.gz.

File metadata

Download URL: document_image_extractor-0.1.0.tar.gz
Upload date: May 30, 2026
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for document_image_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d2848d874e86b3b029224eabb440827945cf491efd2570cf74634aab57c53662`
MD5	`1bb6aba9fe0a7c1c88c5f08f1ca94d0f`
BLAKE2b-256	`9b90c21b022131f86bd95d9896944342712f70d9e12c3b1f6cbd0d9703219ea9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_image_extractor-0.1.0.tar.gz:

Publisher: publish-pypi.yml on LeoMurilloDev/document-image-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: document_image_extractor-0.1.0.tar.gz
- Subject digest: d2848d874e86b3b029224eabb440827945cf491efd2570cf74634aab57c53662
- Sigstore transparency entry: 1676096211
- Sigstore integration time: May 30, 2026
Source repository:
- Permalink: LeoMurilloDev/document-image-extractor@a04ba851be97cc365fb6ada15edab42bdbb99812
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/LeoMurilloDev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@a04ba851be97cc365fb6ada15edab42bdbb99812
- Trigger Event: release

File details

Details for the file document_image_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: document_image_extractor-0.1.0-py3-none-any.whl
Upload date: May 30, 2026
Size: 17.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for document_image_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2855c58a55aadb86293f08ecd76d593b9fb6d01ffc5ee394284d92b4e32317e7`
MD5	`13b95c93fd48aa80c0d22712df665081`
BLAKE2b-256	`4da029c28335f7f2c9648511e983aa77c3293feb0cb0ae75c0313ccd075cd14f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_image_extractor-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on LeoMurilloDev/document-image-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: document_image_extractor-0.1.0-py3-none-any.whl
- Subject digest: 2855c58a55aadb86293f08ecd76d593b9fb6d01ffc5ee394284d92b4e32317e7
- Sigstore transparency entry: 1676096218
- Sigstore integration time: May 30, 2026
Source repository:
- Permalink: LeoMurilloDev/document-image-extractor@a04ba851be97cc365fb6ada15edab42bdbb99812
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/LeoMurilloDev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@a04ba851be97cc365fb6ada15edab42bdbb99812
- Trigger Event: release

document-image-extractor 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Document-Image-Extractor

Features:

Project status

Requirements

Installation

1. Clone the repository

2. Create and activate a virtual environment

Windows

macOS / Linux

3. Install dependencies

Usage

Folder structure expected by the script

Configuration

Run

CLI usage

Output

What to expect

Important notes

Test suites

Documents to try

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance