Skip to main content

Audit and clean image datasets before training, labeling, or sharing.

Project description

๐Ÿงน imgclean

Find duplicates, blur, corruption, leakage, and quality issues in image datasets before they ship.

Python License: MIT CI PyPI Tests


Most image datasets have hidden problems. imgclean makes them obvious in one pass, with a CLI that is fast to try and reports that are easy to review with a team.

$ imgclean scan ./dataset --workers 8 --report-dir ./reports
                      Scan Summary
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Metric              โ”ƒ Value โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Total files         โ”‚ 12438 โ”‚
โ”‚ Scanned OK          โ”‚ 12397 โ”‚
โ”‚ Corrupted           โ”‚    41 โ”‚
โ”‚ Total findings      โ”‚  1525 โ”‚
โ”‚   โ†ณ near duplicate  โ”‚  1083 โ”‚
โ”‚   โ†ณ exact duplicate โ”‚   214 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

imgclean HTML report preview

Highlights

  • One command to scan a dataset and export HTML, JSON, and CSV reports.
  • Built-in checks for corruption, blur, exposure, resolution, aspect ratio, duplicates, and split leakage.
  • Parallel scan path with --workers and config-based parallel.max_workers.
  • Works as both a CLI tool and a Python API for pipelines and notebooks.
  • Safe cleanup workflow with clean, quarantine, and representative-keep actions.
  • Test-backed core with 50 automated test cases and GitHub Actions CI.

Try it in 60 seconds

pip install imgclean
imgclean clean ./dataset --workers 8 --report-dir ./reports

The command writes a shareable HTML report plus machine-readable JSON and CSV outputs in ./reports, then previews the quarantine plan without moving anything unless you add --execute.


Contents


๐Ÿค” Why imgclean

Problem What goes wrong
Exact duplicates in training data Model memorises samples, inflated accuracy
Near-duplicates crossing train/val Evaluation metrics are meaningless
Blurry or tiny images Wasted annotation budget, noisy gradients
Corrupted files Silent crashes in your data loader at 3 AM
Overexposed / underexposed frames Class imbalance in lighting conditions
Mislabeled split assignments You think your model generalises; it does not

imgclean makes these problems visible in seconds and gives you tools to fix them.


๐ŸฅŠ Compared with other workflows

Workflow Duplicate + leakage checks Cleanup actions Shareable reports Best fit
imgclean โœ… built in โœ… clean / quarantine โœ… HTML + JSON + CSV Pre-training dataset QA
cleanvision โœ… focused on image issues โŒ review-only โš ๏ธ notebook/report oriented Exploratory dataset analysis
FiftyOne โš ๏ธ possible with app workflows โš ๏ธ manual curation flows โœ… interactive app views Large visual review workflows
Manual scripts โš ๏ธ custom only โš ๏ธ custom only โŒ usually none One-off internal jobs

๐Ÿ“ฆ Installation

pip install imgclean

Optional โ€” CLIP-based near-duplicate detection and outlier analysis:

pip install "imgclean[embeddings]"   # torch + open_clip + faiss-cpu

Development install:

git clone https://github.com/Weiykong/imgclean.git
cd imgclean
python3 -m pip install --user uv
make install
make test

Supported formats: JPEG ยท PNG ยท BMP ยท GIF ยท TIFF ยท WebP


๐Ÿš€ Quick start

CLI

# Full audit โ€” produces HTML, JSON, and CSV reports
imgclean scan ./dataset --workers 8 --report-dir ./reports --open

# Duplicates only, strict threshold
imgclean dedup ./dataset --threshold 4 --workers 8

# Check train/val/test splits for data leakage
imgclean leakage ./train ./val ./test

# Quality checks (blur, exposure, resolution)
imgclean quality ./dataset --workers 8

# Scan and preview a cleanup plan in one step
imgclean clean ./dataset --issues corrupted,blurry --report-dir ./reports

# Preview what would be quarantined, then do it
imgclean quarantine ./dataset --issues corrupted,blurry
imgclean quarantine ./dataset --issues corrupted,blurry --execute

Python API

from imgclean import scan_dataset

report = scan_dataset("./dataset")
print(f"{report.summary.findings_count} issues found in {report.summary.duration_seconds:.1f}s")

# Specific checks only
report = scan_dataset(
    "./dataset",
    checks=["blur", "corruption", "duplicates"],
    thresholds={"blur_laplacian_min": 80.0, "min_width": 128},
)

# Split-aware scan (enables leakage detection)
report = scan_dataset(
    "./dataset",
    splits={"train": "./train", "val": "./val", "test": "./test"},
)

# Iterate findings
for f in report.findings:
    print(f"[{f.severity.value}] {f.issue_type.value}: {f.file_path.name}")

๐Ÿ–ฅ๏ธ CLI reference

imgclean scan โ€” full dataset audit

imgclean scan <path> [OPTIONS]
Option Default Description
--config, -c โ€” YAML or JSON config file
--report-dir, -o . Output directory for reports
--no-html false Skip HTML report
--no-json false Skip JSON report
--no-csv false Skip CSV report
--open false Open HTML in browser after scan
--no-cache false Disable feature cache
--workers, -w auto Max worker threads for image scanning
--verbose, -v false Debug logging
imgclean scan ./dataset --workers 8 --report-dir ./audit --open --config imgclean.yaml

imgclean dedup โ€” duplicate detection

imgclean dedup <path> [OPTIONS]
Option Default Description
--threshold, -t 8 Max Hamming distance (0 = exact byte matches only)
--report-dir, -o . Output directory
--workers, -w auto Max worker threads for image scanning
imgclean dedup ./dataset --threshold 6 --workers 8
imgclean dedup ./dataset --threshold 0   # exact duplicates only

imgclean leakage โ€” split contamination check

imgclean leakage <train> [val] [test] [OPTIONS]

Detects images (exact or perceptually similar) that appear in more than one split.

imgclean leakage ./train ./val ./test --report-dir ./leakage_report

imgclean quality โ€” quality checks only

imgclean quality <path> [OPTIONS]
Option Description
--blur/--no-blur Check for blur (default on)
--exposure/--no-exposure Check over/underexposure (default on)
--resolution/--no-resolution Check resolution (default on)
--workers, -w Max worker threads for image scanning
imgclean quality ./dataset --workers 8 --no-exposure

imgclean clean โ€” scan then quarantine

imgclean clean <path> [OPTIONS]
Option Default Description
--issues, -i all errors Comma-separated issue types to quarantine
--out, -o ./quarantine Destination folder
--execute false Actually move files (default is dry-run)
--report-dir . Output directory for HTML, JSON, and CSV reports
--workers, -w auto Max worker threads for image scanning
# Preview cleanup + write reports
imgclean clean ./dataset --issues corrupted,blurry --workers 8 --report-dir ./reports

# Then execute
imgclean clean ./dataset --issues corrupted --out ./review --execute

imgclean quarantine โ€” move flagged files

imgclean quarantine <path> [OPTIONS]
Option Default Description
--issues, -i all errors Comma-separated issue types
--out, -o ./quarantine Destination folder
--execute false Actually move files (default is dry-run)
# Preview first
imgclean quarantine ./dataset --issues corrupted,blurry

# Then execute
imgclean quarantine ./dataset --issues corrupted,blurry --out ./review --execute

Valid issue types: corrupted ยท low_resolution ยท aspect_ratio ยท blurry ยท underexposed ยท overexposed ยท exact_duplicate ยท near_duplicate ยท split_leakage ยท outlier


imgclean report โ€” re-render HTML from JSON

imgclean report imgclean_report.json --open
imgclean report results.json --html report_v2.html

๐Ÿ Python API

scan_dataset()

from imgclean import scan_dataset

report = scan_dataset(
    path,                  # str | Path  โ€” dataset root
    config_file=None,      # str | Path  โ€” YAML/JSON config
    checks=None,           # list[str]   โ€” checks to run (None = all enabled)
    thresholds=None,       # dict        โ€” threshold overrides
    splits=None,           # dict[str, Path] โ€” split directories
    cache=True,            # bool        โ€” disk feature cache
    verbose=False,         # bool        โ€” debug logging
)

Working with results

# Summary
s = report.summary
print(s.total_files, s.findings_count, s.issue_counts)

# All findings
for f in report.findings:
    print(f.issue_type.value, f.severity.value, f.file_path, f.score)

# Grouped by type
by_type = report.findings_by_type()
blurry  = by_type.get("blurry", [])
dupes   = by_type.get("exact_duplicate", [])

# Duplicate clusters
groups = {}
for f in dupes:
    groups.setdefault(f.group_id, []).append(f.file_path)

Post-scan actions

from imgclean.actions import quarantine_findings, get_removal_candidates
from imgclean.reports import write_html, write_json
from pathlib import Path

# Write reports manually (API does not write files by default)
write_json(report, Path("report.json"))
write_html(report, Path("report.html"), open_browser=True)

# Quarantine problematic files (dry_run=True by default)
quarantine_findings(
    findings=report.findings,
    quarantine_dir=Path("./quarantine"),
    issue_filter=["corrupted", "blurry"],
    root=Path("./dataset"),
    dry_run=False,   # set True to preview
)

# Files to remove to deduplicate (keeps one representative per cluster)
to_remove = get_removal_candidates(report.findings)

Finding fields

Field Type Description
issue_type IssueType Enum: corrupted, blurry, exact_duplicate, โ€ฆ
severity Severity error ยท warning ยท info
file_path Path Absolute path to the affected file
message str Human-readable explanation
score float | None Measured value (e.g. Laplacian variance, Hamming distance)
threshold float | None Threshold that triggered the finding
related_files list[Path] Duplicate partners, leakage matches
group_id str | None Cluster ID for grouped issues
metadata dict Extra context (brightness, width/height, โ€ฆ)

โš™๏ธ Configuration

imgclean scan ./dataset --config imgclean.yaml
Full annotated imgclean.yaml
dataset:
  path: ./dataset
  recursive: true

checks:
  corruption: true
  resolution: true
  aspect_ratio: true
  blur: true
  exposure: true
  exact_duplicates: true
  perceptual_duplicates: true
  embedding_duplicates: false   # requires imgclean[embeddings]
  split_leakage: true
  outliers: false               # requires imgclean[embeddings]

thresholds:
  # Resolution
  min_width: 256
  min_height: 256

  # Aspect ratio  (width / height)
  aspect_ratio_min: 0.1         # flag very tall images
  aspect_ratio_max: 10.0        # flag very wide images

  # Blur  (Laplacian variance โ€” higher = sharper)
  blur_laplacian_min: 60.0

  # Exposure  (mean pixel brightness 0โ€“255)
  exposure_dark_max: 25.0
  exposure_bright_min: 230.0

  # Perceptual duplicates  (pHash Hamming distance)
  phash_hamming_max: 8

  # Embedding duplicates  (cosine similarity 0โ€“1)
  embedding_similarity_min: 0.95

  # Outliers  (kNN on embedding space)
  outlier_knn_k: 5
  outlier_distance_percentile: 95.0

report:
  html: true
  json_report: true
  csv_report: true
  output_dir: ./reports
  open_browser: false

actions:
  quarantine: false
  quarantine_dir: ./quarantine
  dry_run: true          # always preview before executing

cache:
  enabled: true
  dir_name: .imgclean_cache

parallel:
  max_workers: null       # null = ThreadPoolExecutor default

Merge priority (highest wins): CLI flags โ†’ config file โ†’ built-in defaults


๐Ÿ” Checks

File integrity

Check Issue Severity How
corruption corrupted ๐Ÿ”ด error PIL two-pass: verify() (header/checksum) + load() (pixel decode)

Quality

Check Issue Severity How
blur blurry ๐ŸŸก warning Variance of the Laplacian โ€” low variance = uniform = blurry
exposure underexposed ๐ŸŸก warning Mean brightness < exposure_dark_max (default 25)
exposure overexposed ๐ŸŸก warning Mean brightness > exposure_bright_min (default 230)
resolution low_resolution ๐ŸŸก warning Width or height below min_width / min_height
aspect_ratio aspect_ratio ๐ŸŸก warning Ratio outside [aspect_ratio_min, aspect_ratio_max]

Duplicates

Check Issue Severity How
exact_duplicates exact_duplicate ๐ŸŸก warning SHA-256 hash grouping
perceptual_duplicates near_duplicate ๐ŸŸก warning pHash + Hamming distance โ‰ค threshold; union-find clustering
embedding_duplicates โœจ embedding_duplicate ๐ŸŸก warning CLIP cosine similarity โ‰ฅ threshold

Split integrity

Check Issue Severity How
split_leakage (exact) split_leakage ๐Ÿ”ด error Same SHA-256 across splits
split_leakage (perceptual) split_leakage ๐ŸŸก warning pHash Hamming distance โ‰ค threshold across splits

Outliers

Check Issue Severity How
outliers โœจ outlier ๐Ÿ”ต info Mean kNN cosine distance above the Nth percentile

โœจ Requires pip install "imgclean[embeddings]"


๐Ÿ“„ Outputs

HTML report

A self-contained HTML file (no external dependencies):

  • Summary cards โ€” total files, scanned OK, corrupted, findings by type
  • Per-issue tables โ€” file path ยท severity ยท score ยท threshold ยท message
  • Cluster view โ€” duplicate and leakage groups, representative highlighted

JSON report

{
  "summary": {
    "total_files": 1000,
    "scanned_files": 997,
    "corrupted_files": 3,
    "findings_count": 142,
    "issue_counts": { "blurry": 31, "exact_duplicate": 44, "corrupted": 3 },
    "duration_seconds": 4.2
  },
  "findings": [
    {
      "issue_type": "blurry",
      "severity": "warning",
      "file_path": "dataset/train/img_042.jpg",
      "score": 12.3,
      "threshold": 60.0,
      "message": "Image appears blurry (Laplacian variance 12.3 < threshold 60.0)."
    }
  ]
}

CSV report

One row per finding โ€” ready for spreadsheet review or programmatic filtering:

issue_type,severity,file_path,score,threshold,group_id,related_files,message
blurry,warning,train/img_042.jpg,12.3,60.0,,,Image appears blurry...
exact_duplicate,warning,train/cat_001.jpg,,,a3b1c9,val/cat_001.jpg,Exact duplicate...

๐Ÿ—๏ธ Architecture

imgclean follows a strict layered design โ€” each layer has a single responsibility and only depends on layers below it.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  cli/        Command-line interface (Typer + Rich)          โ”‚
โ”‚  api/        Public Python API  scan_dataset()              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  core/       Orchestration: scanner ยท pipeline ยท registry   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  reports/              โ”‚  actions/                          โ”‚
โ”‚  HTML ยท JSON ยท CSV     โ”‚  quarantine ยท move ยท dedup         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  checks/     10 independent checks (BaseCheck subclasses)   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  features/   Laplacian ยท brightness ยท pHash ยท CLIP embeds   โ”‚
โ”‚  io/         filesystem ยท image loader ยท hashing ยท cache    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  models/     ImageRecord ยท Finding ยท Dataset ยท ScanReport   โ”‚
โ”‚  config/     Pydantic schema ยท YAML/JSON loader             โ”‚
โ”‚  utils/      logging ยท timing ยท parallel_map ยท thresholds   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Layer-by-layer breakdown

models/ โ€” pure data structures

File Class Description
image_record.py ImageRecord One image: path, size, format, sha256, phash, corruption flag
finding.py Finding One issue: type, severity, score, threshold, related files, cluster id
issue_types.py IssueType, Severity Enums for all issue and severity types
dataset.py Dataset List of ImageRecords with helpers (valid(), by_split(), corrupted())
report.py ReportSummary, ScanReport Aggregated results: summary stats + all findings
actions.py ActionType, ActionPlan Describes a planned file operation

config/ โ€” typed configuration

File Purpose
defaults.py Module-level constants for every threshold and setting
schema.py Pydantic v2 models with validation (Config, ChecksConfig, ThresholdsConfig, โ€ฆ)
loader.py load_config(path, overrides) โ€” loads YAML/JSON and deep-merges CLI overrides

io/ โ€” all file access

File Key function(s)
filesystem.py discover_images(root, recursive) โ€” glob with extension filtering
image_loader.py load_image(path) โ†’ LoadResult โ€” two-pass: verify() then load()
hashing.py sha256(path), phash(image), dhash(image), hamming_distance(h1, h2)
cache.py FeatureCache โ€” JSON disk cache keyed by file path, invalidated on mtime change

Why two-pass image loading? PIL's verify() must be called before load() and checks headers/checksums. load() forces full pixel decoding and catches truncated files. They must run in separate with Image.open() blocks.


features/ โ€” shared computation

File Functions What
quality.py laplacian_variance(img) Blur score via OpenCV Laplacian
quality.py mean_brightness(img) Mean pixel intensity (greyscale, 0โ€“255)
perceptual.py compute_phash(img), compute_dhash(img) Perceptual hashes via imagehash
metadata.py file_metadata(path), exif_metadata(img) File size, mtime, EXIF tags
embeddings.py embed_image(img), cosine_similarity(a, b) CLIP embeddings (lazy-loaded, optional)

checks/ โ€” analysis logic

Every check inherits BaseCheck and implements one method:

class BaseCheck(ABC):
    name: str           # used in config keys and reports
    description: str

    def run(self, dataset: Dataset) -> list[Finding]: ...
    def is_enabled(self) -> bool: ...   # reads config.checks.<name>

Checks are stateless, independent, and testable in isolation. They never read from disk โ€” the scanner pre-populates all fields on ImageRecord.

Class name Notes
CorruptionCheck corruption Reads record.is_corrupted set by scanner
ResolutionCheck resolution Compares record.width/height to thresholds
AspectRatioCheck aspect_ratio Uses record.aspect_ratio property
BlurCheck blur Re-loads image, calls laplacian_variance()
ExposureCheck exposure Re-loads image, calls mean_brightness()
ExactDuplicatesCheck exact_duplicates Groups by record.sha256
PerceptualDuplicatesCheck perceptual_duplicates Union-find on pHash Hamming distances
EmbeddingDuplicatesCheck embedding_duplicates CLIP cosine similarity (optional)
SplitLeakageCheck split_leakage SHA-256 and pHash cross-split comparison
OutliersCheck outliers kNN distance on CLIP embedding matrix (optional)

core/ โ€” orchestration

File Key function What
registry.py build_checks(config) Instantiate enabled checks in execution order
scanner.py scan_directory(), scan_splits() Build Dataset from disk, populate ImageRecords
pipeline.py run_pipeline(checks, dataset) Run each check, collect findings, log timing
orchestrator.py run_scan(paths, config, split_map) Top-level entry point

Execution order (cheap per-file checks first, expensive group checks last):

Corruption โ†’ Resolution โ†’ AspectRatio โ†’ Blur โ†’ Exposure
โ†’ ExactDuplicates โ†’ PerceptualDuplicates โ†’ EmbeddingDuplicates
โ†’ SplitLeakage โ†’ Outliers

reports/ โ€” output generation

File Output
html.py Self-contained HTML via Jinja2 (templates/report.html.j2)
json.py Full JSON (summary + all findings as dicts)
csv.py One row per finding; related_files joined with |

actions/ โ€” file operations

All functions accept dry_run=True so you can always preview before committing.

File Function What
quarantine.py quarantine_findings(...) Move flagged files to a review folder
move.py move_files(paths, dest, root, dry_run) Move, preserving relative structure
copy.py copy_files(paths, dest, root, dry_run) Copy to destination
keep_representative.py select_representatives(findings) Pick one file per duplicate cluster
keep_representative.py get_removal_candidates(findings) Flat list of non-representative files

Data flow

images/
  โ†“  filesystem.py       discover paths
  โ†“  scanner.py          build ImageRecords (load ยท hash ยท cache)
  โ†“
Dataset[ImageRecord]
  โ†“  registry.py         build enabled checks
  โ†“  pipeline.py         run each check in order
  โ†“
list[Finding]
  โ†“  orchestrator.py     build ScanReport + ReportSummary
  โ†“
reports/   โ†’  HTML ยท JSON ยท CSV
actions/   โ†’  quarantine ยท dedup cleanup   (optional)

โœจ Optional: embedding-based features

pip install "imgclean[embeddings]"

Enables two checks that use CLIP (ViT-B/32):

Check What it finds
embedding_duplicates Visually similar images even when pHash disagrees โ€” cropped, colour-shifted, or resized variants
outliers Images that are visually isolated from the rest of the dataset
# imgclean.yaml
checks:
  embedding_duplicates: true
  outliers: true

thresholds:
  embedding_similarity_min: 0.95
  outlier_knn_k: 5
  outlier_distance_percentile: 95.0
report = scan_dataset(
    "./dataset",
    checks=["embedding_duplicates", "outliers"],
)

GPU is used automatically when available; falls back to CPU.


๐Ÿงช Test suite

The repo currently ships with 50 automated tests covering configuration, hashing, duplicate detection, parallel scan plumbing, CLI cleanup flows, reporting, and a synthetic end-to-end scan pipeline.

make test
make lint   # C901 complexity gate

CI runs on Python 3.10, 3.11, and 3.12 for pushes and pull requests.


๐Ÿ—บ๏ธ Roadmap

Version Features
v1.1 Thumbnail galleries in HTML report ยท Faster SQLite cache
v1.2 Class-aware analysis ยท Per-class outliers ยท Imbalance summary
v1.3 Bounding box sanity checks ยท Segmentation mask QA
v2 Interactive web UI ยท Dataset version comparison

Contributing

git clone https://github.com/Weiykong/imgclean.git
cd imgclean
python3 -m pip install --user uv
make install
make check

See CONTRIBUTING.md for the local setup, command reference, and PR checklist.


License

MIT ยฉ Wei Yuan Kong

If imgclean saves you dataset cleanup time, consider starring the repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgclean-0.1.0.tar.gz (46.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imgclean-0.1.0-py3-none-any.whl (63.5 kB view details)

Uploaded Python 3

File details

Details for the file imgclean-0.1.0.tar.gz.

File metadata

  • Download URL: imgclean-0.1.0.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for imgclean-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ee5d19a28dd2ecf9e6b5faa72d2aa5527ba1117983fb9303a1e90de4c62ec90c
MD5 e2b36328b121e831658c1a1e36dc9653
BLAKE2b-256 4fc6236dbcf4d1fabd5ed8ebfcfab0e1b712c999a5bdce68a8fb23837cc7837d

See more details on using hashes here.

File details

Details for the file imgclean-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: imgclean-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 63.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for imgclean-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41e61c9b1a1a1ee15d6616734f92fe5151b344a9cba7a823424bbf83f7219a39
MD5 1a9a3596162a8b62c281cd34821abd27
BLAKE2b-256 48c4cb88faad6bf86c70f8d3932926b33430e41693552850d8a69bc1e147c548

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page