Skip to main content

A pipeline for curating and sanitizing large-scale image datasets.

Project description

Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

Status

This project is in early development. Most features are functional, but APIs may still change.

  • Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
  • Features in Progress: Duplicate removal, Rotation correction

Feedback and contributions are welcome.

Features

VDC provides modular tools for dataset cleanup:

  • Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
  • Example-based filtering - remove images similar to a set of unwanted examples
  • Image Quality Filtering - remove images based on aesthetic score or NSFW classification
  • Duplicate removal - identify and remove near-duplicate images from your dataset
  • Hierarchical K-Means sampling - select diverse, representative subsets from large datasets

Coming soon:

  • Rotation correction (correct 90°/180°/270° orientation errors)

The Curation Pipeline

flowchart LR
    A[Raw<br/>Dataset] --> V[Validation]
    V --> R[Rotation*]
    R --> D[Dedup]
    D --> E[Example<br/>Filter]
    E --> Q[Quality Filter<br/>Aesthetic/NSFW]
    Q --> S[Cluster-based<br/>Sampling]
    S --> F[Curated<br/>Dataset]

    U[Unwanted<br/>Examples] --> E

Note: * = WIP

Installation

From PyPI

pip install vision-data-curation

From Source

git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .

Developing directly from the project root allows for script and configuration execution as if fully installed.

Usage

Each step is a script under vdc.scripts.

Examples:

# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/

# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
  • Run python -m vdc.scripts to see available scripts
  • Run python -m vdc.scripts.<script> --help for options

Configuration:

  • Default settings live in vdc/conf/config.json
  • A config.json in your project root will take precedence (or pass --config to any script)

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_data_curation-0.0.1.dev6.tar.gz (49.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vision_data_curation-0.0.1.dev6-py3-none-any.whl (59.7 kB view details)

Uploaded Python 3

File details

Details for the file vision_data_curation-0.0.1.dev6.tar.gz.

File metadata

File hashes

Hashes for vision_data_curation-0.0.1.dev6.tar.gz
Algorithm Hash digest
SHA256 18506cfdcbf78564d15c903d16e8d3aa65e886266074d91579f5f726f0adb31d
MD5 b1b6bfbc9c04463e879e1466fcf0cbe3
BLAKE2b-256 62d752248b7953f9818fe02fe9e7618587b40dc98231bebe8481276e7f1dadef

See more details on using hashes here.

File details

Details for the file vision_data_curation-0.0.1.dev6-py3-none-any.whl.

File metadata

File hashes

Hashes for vision_data_curation-0.0.1.dev6-py3-none-any.whl
Algorithm Hash digest
SHA256 7eabbdfe94542aed79fff00900cb5d5fceeed863c9081fd8d061023bbe0b0e6c
MD5 f0a31e1a22153cc2a112da111e3ca329
BLAKE2b-256 2bac6f4a2eddd104d44e3f1557c954b6d61753f1747194561e4d0c67f78f0319

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page