Skip to main content

A pipeline for curating and sanitizing large-scale image datasets.

Project description

Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

Status

This project is in early development. Most features are functional, but APIs may still change.

  • Implemented Features: Input validation, Duplicate removal, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
  • Features in Progress: Rotation correction

Feedback and contributions are welcome.

Features

VDC provides modular tools for dataset cleanup:

  • Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
  • Example-based filtering - remove images similar to a set of unwanted examples
  • Image Quality Filtering - remove images based on aesthetic score or NSFW classification
  • Duplicate removal - identify and remove near-duplicate images from your dataset
  • Hierarchical K-Means sampling - select diverse, representative subsets from large datasets

Coming soon:

  • Rotation correction (correct 90°/180°/270° orientation errors)

The Curation Pipeline

flowchart LR
    A[Raw<br/>Dataset] --> V[Validation]
    V --> R[Rotation*]
    R --> D[Dedup]
    D --> E[Example<br/>Filter]
    E --> Q[Quality Filter<br/>Aesthetic/NSFW]
    Q --> S[Cluster-based<br/>Sampling]
    S --> F[Curated<br/>Dataset]

    U[Unwanted<br/>Examples] --> E

Note: * = WIP

Installation

From PyPI

pip install vision-data-curation

From Source

git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .

Developing directly from the project root allows for script and configuration execution as if fully installed.

Usage

Each step is a script under vdc.scripts.

Examples:

# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/

# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
  • Run python -m vdc.scripts to see available scripts
  • Run python -m vdc.scripts.<script> --help for options

Configuration:

  • Default settings live in vdc/conf/config.json
  • A config.json in your project root will take precedence (or pass --config to any script)

Documentation

For detailed walkthroughs, examples, and in-depth guides, please refer to our full documentation.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_data_curation-0.0.1.dev8.tar.gz (52.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vision_data_curation-0.0.1.dev8-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file vision_data_curation-0.0.1.dev8.tar.gz.

File metadata

File hashes

Hashes for vision_data_curation-0.0.1.dev8.tar.gz
Algorithm Hash digest
SHA256 13419ed8ae2af1943da9101a9ecb026dbd014a1f8627ac75660e832a4d4539de
MD5 7091fa38b87f1f8b52b1d7f276a87225
BLAKE2b-256 561247c57286eed028b9b1eba2bf942f78bb3e29ef902fbd3fdacaf3d41f6b04

See more details on using hashes here.

File details

Details for the file vision_data_curation-0.0.1.dev8-py3-none-any.whl.

File metadata

File hashes

Hashes for vision_data_curation-0.0.1.dev8-py3-none-any.whl
Algorithm Hash digest
SHA256 611692941d0cd2388cc85e4d722204212fc2b0980dd7b8a487b65d91084df9b5
MD5 85ef2c2e121d2924dd1e1affc790abeb
BLAKE2b-256 41b8bceb9c96e3f7b807049fb656c49f08713f56fef5e4c884e987571ae67b68

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page