CLI tool to extract embedded images from PDF, DOCX, PPTX and XLSX files.
Project description
Document-Image-Extractor
CLI tool to extract embedded images from DOCX, PDF, PPTX and XLSX files, with deduplication , size filtering, and batch export to ZIPs.
Features:
Extract images from:
- DOCX (Word documents)
- PDF (documents)
- PPTX (Powerpoint documents)
- XLSX (Excel documents)
Outputs:
- Creates a ZIP per input file with extracted images
built-in helpers:
- Deduplication (skips repeated images within the same document)
- Size filter (
min_kbdefault is 5kb) - Handles “no images” and corrupt files gracefully
Project status
this repository is begin improved phase by phase
Requirements
- python 3.12+ (recomended)
Dependencies (install from 'requirements.txt'):
- python-docx
- PyMuPDF
- pillow
Installation
1. Clone the repository
git clone https://github.com/LeoMurilloDev/document-image-extractor.git
cd document-image-extractor
2. Create and activate a virtual environment
Windows
python -m venv .venv
.\.venv\Scripts\activate
macOS / Linux
python3 -m venv .venv
source .venv/bin/activate
3. Install dependencies
pip install -r requirements.txt
Usage
Folder structure expected by the script
the script creates these folders automatically if they don't exist:
- Entrdas_archivos/ -> place your .docx and .pdf files here
- Salidas_archivos/ -> output ZIPs will be generated here
- temp/ -> temporary extraction folder (auto-cleaned)
Configuration
You can customize filters without editing the code using config.json (repo root).
Example:
{
"filters": {
"min_kb": 5,
"min_width": 0,
"min_height": 0
}
}
min_kb: minimum file size in kb (default: 5)min_width/min_height: optional dimension filter (0 disables it)
Run
python main.py
CLI usage
The tool can be used with default folders/config:
python main.py
python main.py --input Entradas_archivos --output Salidas_archivos
python main.py --input example.pptx --output Salidas_archivos
python main.py --input Entradas_archivos --recursive
python main.py --input Entradas_archivos --min-kb 1 --min-width 100 --min-height 100
python main.py --input Entradas_archivos --no-dedup
python main.py --input Entradas_archivos --format folder
python main.py --input Entradas_archivos --log-level DEBUG --log-file logs/debug.log
Output
- For each input file, a ZIP is created in Salidas_archivos/
- Example:
- Input: Entradas_archivos/report.pdf
- Output: Salidas_archivos/report.zip
What to expect
When you run the script, it prints a summary per file:
guardadas-> images saved successfullyduplicadas-> images skipped due to hash duplicationpequeñas-> images filtered out by sizeencontradas-> images found inside the document
Important notes
- In
DOCX, images are saved using the real extension (.jpg, .png, .gif, etc) temp/is cleaned even when a file fails
Test suites
we use small test suites to validate.
Documents to try
Includes:
- Mixed formats (JPG/PNG/GIF)
- Duplicates
- Small icon filtered out by size
- Corrupt files (error handling) Manual validation steps:
- Copy test files into
Entradas_archivos/ - Run
python main.py - Verify
- Output ZIPs exist in
Salidas_archivos/ - Extencions are correct in DOCX resutls (.jpg, .png, .gif)
- Duplicates are removed
temp/is empty at the end
- Output ZIPs exist in
Contributing
if you want to propose changes:
- Fork the repo
- Create a branch
- Open a PR with a clear description
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_image_extractor-0.1.0.tar.gz.
File metadata
- Download URL: document_image_extractor-0.1.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2848d874e86b3b029224eabb440827945cf491efd2570cf74634aab57c53662
|
|
| MD5 |
1bb6aba9fe0a7c1c88c5f08f1ca94d0f
|
|
| BLAKE2b-256 |
9b90c21b022131f86bd95d9896944342712f70d9e12c3b1f6cbd0d9703219ea9
|
Provenance
The following attestation bundles were made for document_image_extractor-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on LeoMurilloDev/document-image-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
document_image_extractor-0.1.0.tar.gz -
Subject digest:
d2848d874e86b3b029224eabb440827945cf491efd2570cf74634aab57c53662 - Sigstore transparency entry: 1676096211
- Sigstore integration time:
-
Permalink:
LeoMurilloDev/document-image-extractor@a04ba851be97cc365fb6ada15edab42bdbb99812 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/LeoMurilloDev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a04ba851be97cc365fb6ada15edab42bdbb99812 -
Trigger Event:
release
-
Statement type:
File details
Details for the file document_image_extractor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: document_image_extractor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2855c58a55aadb86293f08ecd76d593b9fb6d01ffc5ee394284d92b4e32317e7
|
|
| MD5 |
13b95c93fd48aa80c0d22712df665081
|
|
| BLAKE2b-256 |
4da029c28335f7f2c9648511e983aa77c3293feb0cb0ae75c0313ccd075cd14f
|
Provenance
The following attestation bundles were made for document_image_extractor-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on LeoMurilloDev/document-image-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
document_image_extractor-0.1.0-py3-none-any.whl -
Subject digest:
2855c58a55aadb86293f08ecd76d593b9fb6d01ffc5ee394284d92b4e32317e7 - Sigstore transparency entry: 1676096218
- Sigstore integration time:
-
Permalink:
LeoMurilloDev/document-image-extractor@a04ba851be97cc365fb6ada15edab42bdbb99812 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/LeoMurilloDev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a04ba851be97cc365fb6ada15edab42bdbb99812 -
Trigger Event:
release
-
Statement type: