Skip to main content

A pipeline for curating and sanitizing large-scale image datasets.

Project description

Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

Status

This project is in early development. Most features are functional, but APIs may still change.

  • Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
  • Features in Progress: Duplicate removal, Rotation correction

Feedback and contributions are welcome.

Features

VDC provides modular tools for dataset cleanup:

  • Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
  • Example-based filtering - remove images similar to a set of unwanted examples
  • Image Quality Filtering - remove images based on aesthetic score or NSFW classification
  • Hierarchical K-Means sampling - select diverse, representative subsets from large datasets

Coming soon:

  • Duplicate removal
  • Rotation correction (correct 90°/180°/270° orientation errors)

The Curation Pipeline

flowchart LR
    A[Raw<br/>Dataset] --> V[Validation]
    V --> R[Rotation*]
    R --> D[Dedup*]
    D --> E[Example<br/>Filter]
    E --> Q[Quality Filter<br/>Aesthetic/NSFW/Skip]
    Q --> S[Cluster-based<br/>Sampling]
    S --> F[Curated<br/>Dataset]

    U[Unwanted<br/>Examples] --> E

Note: * = WIP

Installation

From PyPI

pip install vision-data-curation

From Source

git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .

Developing directly from the project root allows for script and configuration execution as if fully installed.

Usage

Each step is a script under vdc.scripts.

Examples:

# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/

# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
  • Run python -m vdc.scripts to see available scripts
  • Run python -m vdc.scripts.<script> --help for options

Configuration:

  • Default settings live in vdc/conf/config.json
  • A config.json in your project root will take precedence (or pass --config to any script)

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_data_curation-0.0.1.dev5.tar.gz (35.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vision_data_curation-0.0.1.dev5-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file vision_data_curation-0.0.1.dev5.tar.gz.

File metadata

File hashes

Hashes for vision_data_curation-0.0.1.dev5.tar.gz
Algorithm Hash digest
SHA256 18df685ac579ed40af26839c12a5ededec1f32766c878858eb24afa042d221ec
MD5 6acd5eb2c9f3d96726d4088be9ac35ff
BLAKE2b-256 a0dea7e1da8097ce1813040dc43c2e5d78b156e0e6b153327eb100c93e60231f

See more details on using hashes here.

File details

Details for the file vision_data_curation-0.0.1.dev5-py3-none-any.whl.

File metadata

File hashes

Hashes for vision_data_curation-0.0.1.dev5-py3-none-any.whl
Algorithm Hash digest
SHA256 bad868b6c08b2821b0a92522038f0ebb0a1e08867de8114d02872b4b5a3b979d
MD5 455d9379f525ed2471df71fb1c1efa6d
BLAKE2b-256 3822e53969ab6084d6ff3f8dde870475e8627c82a6999e18904984f91789833e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page