High-performance image dataset exploratory data analysis CLI tool
Project description
imgeda
High-performance CLI tool for exploratory data analysis of image datasets.
Scan folders of images, generate JSONL manifests with metadata and pixel statistics, detect quality issues, find duplicates, and produce publication-ready visualizations — all from the command line.
Installation
pip install imgeda
Or with uv:
uv tool install imgeda
Quick Start
# Scan a directory of images
imgeda scan ./images -o manifest.jsonl
# View dataset summary
imgeda info -m manifest.jsonl
# Check for quality issues
imgeda check all -m manifest.jsonl
# Generate all plots
imgeda plot all -m manifest.jsonl
# Generate an HTML report
imgeda report -m manifest.jsonl
Or just run imgeda with no arguments for an interactive wizard that walks you through everything.
Features
- Fast parallel scanning with multi-core
ProcessPoolExecutorand Rich progress bars - Resumable — Ctrl+C anytime, progress is saved. Re-run and it picks up where it left off
- JSONL manifest — append-only, crash-tolerant, one record per image
- Per-image analysis: dimensions, file size, pixel statistics (mean/std per channel), brightness, perceptual hashing (phash + dhash), border artifact detection
- Quality checks: corrupt files, dark/overexposed images, border artifacts, exact and near-duplicate detection
- 7 plot types with automatic large-dataset adaptations
- Single-page HTML report with embedded plots and summary tables
- Interactive configurator for guided setup
- Lambda-compatible core — the analysis functions have zero CLI dependencies, ready for serverless deployment
Example Output
All examples below were generated from the Oxford-IIIT Pet Dataset (3,680 images).
Dimensions
Width vs. height scatter plot with reference lines for 720p, 1080p, and 4K resolutions.
Brightness Distribution
Histogram of mean brightness per image, with shaded regions for dark (<40) and overexposed (>220) images.
File Size Distribution
Log-scale histogram with annotated median, P95, and P99 percentile lines.
Aspect Ratio Distribution
Histogram with reference lines at common ratios (1:1, 4:3, 3:2, 16:9).
Channel Distributions
Box plots of mean R/G/B channel values across the dataset.
Border Artifact Analysis
Corner-to-center brightness delta histogram with configurable threshold line.
Duplicate Analysis
Duplicate group sizes and unique vs. duplicate breakdown.
CLI Reference
imgeda scan <DIR>
Scan a directory of images and produce a JSONL manifest.
Options:
-o, --output PATH Output manifest path [default: imgeda_manifest.jsonl]
--workers INTEGER Parallel workers [default: CPU count]
--checkpoint-every INTEGER Flush interval [default: 500]
--resume / --no-resume Auto-resume from existing manifest [default: resume]
--force Force full rescan (ignore existing manifest)
--skip-pixel-stats Metadata-only scan (faster)
--no-hashes Skip perceptual hashing
--extensions TEXT Comma-separated extensions to include
--dark-threshold FLOAT Dark image threshold [default: 40.0]
--overexposed-threshold FLOAT Overexposed threshold [default: 220.0]
--artifact-threshold FLOAT Border artifact threshold [default: 50.0]
--max-image-dim INTEGER Downsample threshold for pixel stats [default: 2048]
imgeda info -m <MANIFEST>
Print a Rich-formatted dataset summary.
imgeda check <SUBCOMMAND> -m <MANIFEST>
Subcommands: corrupt, exposure, artifacts, duplicates, all
imgeda plot <SUBCOMMAND> -m <MANIFEST>
Subcommands: dimensions, file-size, aspect-ratio, brightness, channels, artifacts, duplicates, all
Common options:
-o, --output PATH Output directory [default: ./plots]
--format TEXT Output format: png, pdf, svg [default: png]
--dpi INTEGER DPI for output [default: 150]
--sample INTEGER Sample N records for large datasets
imgeda report -m <MANIFEST>
Generate a single-page HTML report with embedded plots and statistics.
Manifest Format
The manifest is a JSONL file (one JSON object per line):
- Line 1: Metadata header (input directory, scan settings, schema version)
- Lines 2+: One
ImageRecordper image with all computed fields
{"__manifest_meta__": true, "input_dir": "./images", "created_at": "2026-02-17T12:00:00", ...}
{"path": "./images/cat.jpg", "width": 500, "height": 375, "format": "JPEG", "phash": "a1b2c3d4", ...}
The manifest is append-only and crash-tolerant. Resume is keyed on (path, file_size, mtime) — modified files are automatically re-analyzed.
Performance
Tested on a 10-core Apple M1 Pro with SSD:
| Operation | 3,680 images |
|---|---|
| Full scan (metadata + pixels + hashes) | ~8s |
| Plot generation | ~3s |
| HTML report | ~4s |
The tool is designed to handle 100K+ image datasets with batched processing, memory-bounded futures, and automatic plot adaptations for large datasets.
Development
git clone https://github.com/caylent/imgeda.git
cd imgeda
uv sync --all-extras
uv run pytest
uv run ruff check src/ tests/
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imgeda-0.0.4.tar.gz.
File metadata
- Download URL: imgeda-0.0.4.tar.gz
- Upload date:
- Size: 145.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
755c5cc0b8aac8b2b1146d37d9724e76564498a0d3674cf8b1da272bc6af4a19
|
|
| MD5 |
0913e3c5e8aafe41b2b2ffc792a81084
|
|
| BLAKE2b-256 |
5c6560b599b9b55f2503829f35c42698c4a98dfec42be42db3652bb57a8503d6
|
Provenance
The following attestation bundles were made for imgeda-0.0.4.tar.gz:
Publisher:
publish.yml on caylent/imgeda
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
imgeda-0.0.4.tar.gz -
Subject digest:
755c5cc0b8aac8b2b1146d37d9724e76564498a0d3674cf8b1da272bc6af4a19 - Sigstore transparency entry: 959649605
- Sigstore integration time:
-
Permalink:
caylent/imgeda@e9b1ee7e487b362ac83be67a1a76a0392c97718a -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/caylent
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e9b1ee7e487b362ac83be67a1a76a0392c97718a -
Trigger Event:
push
-
Statement type:
File details
Details for the file imgeda-0.0.4-py3-none-any.whl.
File metadata
- Download URL: imgeda-0.0.4-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42f0ea5c804e4b53f2710e486d1c523a3b3954d3b9fec3ff6cc0ee617a6e40ff
|
|
| MD5 |
ca1b93c6d9ed66d1c61a2ffa49da5064
|
|
| BLAKE2b-256 |
daa89b38c4747650f814dbc4884f0922af7d7294bb2205f667c13421b940749d
|
Provenance
The following attestation bundles were made for imgeda-0.0.4-py3-none-any.whl:
Publisher:
publish.yml on caylent/imgeda
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
imgeda-0.0.4-py3-none-any.whl -
Subject digest:
42f0ea5c804e4b53f2710e486d1c523a3b3954d3b9fec3ff6cc0ee617a6e40ff - Sigstore transparency entry: 959649662
- Sigstore integration time:
-
Permalink:
caylent/imgeda@e9b1ee7e487b362ac83be67a1a76a0392c97718a -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/caylent
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e9b1ee7e487b362ac83be67a1a76a0392c97718a -
Trigger Event:
push
-
Statement type: