Skip to main content

High-performance image dataset exploratory data analysis CLI tool

Project description

imgeda

High-performance CLI tool for exploratory data analysis of image datasets.

Scan folders of images, generate JSONL manifests with metadata and pixel statistics, detect quality issues, find duplicates, and produce publication-ready visualizations — all from the command line.

PyPI Python License

Installation

pip install imgeda

Or with uv:

uv tool install imgeda

Quick Start

# Scan a directory of images
imgeda scan ./images -o manifest.jsonl

# View dataset summary
imgeda info -m manifest.jsonl

# Check for quality issues
imgeda check all -m manifest.jsonl

# Generate all plots
imgeda plot all -m manifest.jsonl

# Generate an HTML report
imgeda report -m manifest.jsonl

# Compare two manifests
imgeda diff --old v1.jsonl --new v2.jsonl

# Run quality gate (exit code 2 on failure — CI-friendly)
imgeda gate -m manifest.jsonl -p policy.yml

# Export to Parquet (requires: pip install imgeda[parquet])
imgeda export parquet -m manifest.jsonl -o manifest.parquet

Or just run imgeda with no arguments for an interactive wizard that walks you through everything:

# Interactive mode — auto-detects dataset format (YOLO, COCO, VOC, classification, flat)
imgeda

The wizard detects your dataset structure, shows a summary panel with image counts, splits, and class info, then lets you pick which splits and analyses to run.

Features

  • Fast parallel scanning with multi-core ProcessPoolExecutor and Rich progress bars
  • Resumable — Ctrl+C anytime, progress is saved. Re-run and it picks up where it left off
  • JSONL manifest — append-only, crash-tolerant, one record per image
  • Per-image analysis: dimensions, file size, pixel statistics (mean/std per channel), brightness, perceptual hashing (phash + dhash), border artifact detection, EXIF metadata (camera, lens, focal length, exposure, GPS flagging, distortion risk)
  • Quality checks: corrupt files, dark/overexposed images, border artifacts, exact and near-duplicate detection
  • 7 plot types with automatic large-dataset adaptations
  • Single-page HTML report with embedded plots and summary tables
  • Dataset format detection — auto-detects YOLO, COCO, Pascal VOC, classification, and flat image directories with split-aware scanning
  • Interactive configurator with Rich panels, split selection, and smart defaults
  • Lambda-compatible core — the analysis functions have zero CLI dependencies, ready for serverless deployment
  • Manifest diff — compare two manifests to track dataset changes over time
  • Quality gate — policy-as-code YAML rules with CI-friendly exit codes
  • Parquet export — streaming JSONL-to-Parquet conversion with flattened nested fields
  • AWS serverless deployment — CDK + Step Functions + Lambda for S3-scale analysis

Example Output

All examples below were generated from the Food-101 dataset (2,000 images).

Dimensions

Width vs. height scatter plot with reference lines for 720p, 1080p, and 4K resolutions.

Dimensions

Brightness Distribution

Histogram of mean brightness per image, with shaded regions for dark (<40) and overexposed (>220) images.

Brightness

File Size Distribution

Log-scale histogram with annotated median, P95, and P99 percentile lines.

File Size

Aspect Ratio Distribution

Histogram with reference lines at common ratios (1:1, 4:3, 3:2, 16:9).

Aspect Ratio

Channel Distributions

Violin plots of mean R/G/B channel values across the dataset.

Channels

Border Artifact Analysis

Corner-to-center brightness delta histogram with configurable threshold line.

Artifacts

Duplicate Analysis

Duplicate group sizes and unique vs. duplicate breakdown.

Duplicates

CLI Reference

imgeda scan <DIR>

Scan a directory of images and produce a JSONL manifest.

Options:
  -o, --output PATH           Output manifest path [default: imgeda_manifest.jsonl]
  --workers INTEGER           Parallel workers [default: CPU count]
  --checkpoint-every INTEGER  Flush interval [default: 500]
  --resume / --no-resume      Auto-resume from existing manifest [default: resume]
  --force                     Force full rescan (ignore existing manifest)
  --skip-pixel-stats          Metadata-only scan (faster)
  --skip-exif                 Skip EXIF metadata extraction
  --no-hashes                 Skip perceptual hashing
  --extensions TEXT            Comma-separated extensions to include
  --dark-threshold FLOAT      Dark image threshold [default: 40.0]
  --overexposed-threshold FLOAT  Overexposed threshold [default: 220.0]
  --artifact-threshold FLOAT  Border artifact threshold [default: 50.0]
  --max-image-dim INTEGER     Downsample threshold for pixel stats [default: 2048]

imgeda info -m <MANIFEST>

Print a Rich-formatted dataset summary.

imgeda check <SUBCOMMAND> -m <MANIFEST>

Subcommands: corrupt, exposure, artifacts, duplicates, all

imgeda plot <SUBCOMMAND> -m <MANIFEST>

Subcommands: dimensions, file-size, aspect-ratio, brightness, channels, artifacts, duplicates, all

Common options:
  -o, --output PATH    Output directory [default: ./plots]
  --format TEXT         Output format: png, pdf, svg [default: png]
  --dpi INTEGER         DPI for output [default: 150]
  --sample INTEGER      Sample N records for large datasets

imgeda report -m <MANIFEST>

Generate a single-page HTML report with embedded plots and statistics.

imgeda diff --old <MANIFEST> --new <MANIFEST>

Compare two manifests and show added, removed, and changed images with field-level diffs.

Options:
  -o, --out PATH    Output JSON path (optional)

imgeda gate -m <MANIFEST> -p <POLICY>

Evaluate a manifest against a YAML quality policy. Exit code 0 = pass, 2 = fail.

Options:
  -o, --out PATH    Output JSON path (optional)

Example policy (policy.yml):

max_corrupt_pct: 1.0
max_overexposed_pct: 5.0
max_underexposed_pct: 5.0
max_duplicate_pct: 10.0
min_images_total: 100

imgeda export parquet -m <MANIFEST> -o <OUTPUT>

Export manifest to Parquet format with flattened nested fields. Requires pip install imgeda[parquet].

Architecture

See docs/architecture.md for detailed system diagrams including the local CLI flow, AWS serverless flow, CI/CD quality gate flow, and full module dependency graph.

Manifest Format

The manifest is a JSONL file (one JSON object per line):

  • Line 1: Metadata header (input directory, scan settings, schema version)
  • Lines 2+: One ImageRecord per image with all computed fields
{"__manifest_meta__": true, "input_dir": "./images", "created_at": "2026-02-17T12:00:00", ...}
{"path": "./images/cat.jpg", "width": 500, "height": 375, "format": "JPEG", "camera_make": "Canon", "focal_length_35mm": 50, "distortion_risk": "low", "has_gps_data": false, "phash": "a1b2c3d4", ...}

The manifest is append-only and crash-tolerant. Resume is keyed on (path, file_size, mtime) — modified files are automatically re-analyzed.

Performance

Tested on a 10-core Apple M1 Pro with SSD:

Operation 3,680 images
Full scan (metadata + pixels + hashes) ~8s
Plot generation ~3s
HTML report ~4s

The tool is designed to handle 100K+ image datasets with batched processing, memory-bounded futures, and automatic plot adaptations for large datasets.

Development

git clone https://github.com/caylent/imgeda.git
cd imgeda
uv sync --all-extras
uv run pytest
uv run ruff check src/ tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgeda-0.0.6.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imgeda-0.0.6-py3-none-any.whl (57.3 kB view details)

Uploaded Python 3

File details

Details for the file imgeda-0.0.6.tar.gz.

File metadata

  • Download URL: imgeda-0.0.6.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for imgeda-0.0.6.tar.gz
Algorithm Hash digest
SHA256 8a3df63d9efa1123ba1c8fd871a5b2aae633ab39435ed896585a56863b483a9e
MD5 3069925e2f08bfef9d6690763530e091
BLAKE2b-256 ce48b51b28e33bce9877cc20754120205d850f531185edd7d30dab7951b71217

See more details on using hashes here.

Provenance

The following attestation bundles were made for imgeda-0.0.6.tar.gz:

Publisher: publish.yml on caylent/imgeda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file imgeda-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: imgeda-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 57.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for imgeda-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 71673734d9e1c0fd0d0375b2f6f0a8984bf3353ec1a4353d0344fe75068665ff
MD5 83bf2b3966238d695b667b08b511038b
BLAKE2b-256 68ace9c5f7ecaecf60b97ac83dccc809ee3c40abbfb86e8ee1c56e50df26173f

See more details on using hashes here.

Provenance

The following attestation bundles were made for imgeda-0.0.6-py3-none-any.whl:

Publisher: publish.yml on caylent/imgeda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page