Skip to main content

A pipeline for curating and sanitizing large-scale image datasets.

Project description

Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

At its core, VDC embraces a data-centric AI philosophy, aiming to enhance the quality and diversity of your datasets with minimal manual intervention. By providing a structured, iterative pipeline, VDC helps researchers and practitioners build higher-performing models by starting with better data.

Features

VDC provides modular tools for dataset cleanup:

  • Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
  • Rotation correction - correct 90°/180°/270° orientation errors
  • Example-based filtering - remove images similar to a set of unwanted examples
  • Image Quality Filtering - remove images based on aesthetic score or NSFW classification
  • Duplicate removal - identify and remove near-duplicate images from your dataset
  • Hierarchical K-Means sampling - select diverse, representative subsets from large datasets

The Curation Pipeline

flowchart LR
    A[Raw<br/>Dataset] --> V[Validation]
    V --> R[Rotation]
    R --> D[Dedup]
    D --> E[Example<br/>Filter]
    E --> Q[Quality Filter<br/>Aesthetic/NSFW]
    Q --> S[Cluster-based<br/>Sampling]
    S --> F[Curated<br/>Dataset]

    U[Unwanted<br/>Examples] --> E

Installation

From PyPI

pip install vision-data-curation

From Source

git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .

Developing directly from the project root allows for script and configuration execution as if fully installed.

Usage

Each step is a script under vdc.scripts.

Examples:

# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/

# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
  • Run python -m vdc.scripts to see available scripts
  • Run python -m vdc.scripts.<script> --help for options

Model Downloads: Models required for certain features (e.g., SSCD, PE, CLIP, aesthetic/NSFW predictors) are automatically downloaded to and managed within the models/ directory (defined by vdc.conf.settings.MODELS_DIR).

Configuration

  • Default settings live in vdc/conf/config.json
  • A config.json in your project root will take precedence (or pass --config to any script)

Visualization & Exploration

VDC provides accompanying R Markdown notebooks (found in the notebooks/ directory) to visually inspect and understand the impact of various curation steps, aiding in data-driven decision making.

While these notebooks use the .Rmd extension, they are entirely written in Python. This choice was made for their text-based nature, which significantly improves version control by producing clean and meaningful Git diffs compared to traditional .ipynb files.

For a seamless interactive experience in a Jupyter environment, we recommend installing jupytext (pip install jupytext). With jupytext installed, you can open and run .Rmd files directly in Jupyter as if they were .ipynb notebooks.

Documentation

For detailed walkthroughs, examples, and in-depth guides, please refer to our full documentation.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_data_curation-0.1.0.tar.gz (55.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vision_data_curation-0.1.0-py3-none-any.whl (70.1 kB view details)

Uploaded Python 3

File details

Details for the file vision_data_curation-0.1.0.tar.gz.

File metadata

  • Download URL: vision_data_curation-0.1.0.tar.gz
  • Upload date:
  • Size: 55.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for vision_data_curation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4c4ea41c72b888968901d9d1c10c981c3b6a74a33a5442e98e116e79d313640b
MD5 1deafdc2bf9269f14957d82d5a85223b
BLAKE2b-256 515aae129cae97fcb68a845fc69edc757ad4ad075593de0d3ece667713171025

See more details on using hashes here.

File details

Details for the file vision_data_curation-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for vision_data_curation-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb0587cd48bd4ec48a8aa7fd0707110bcc66d19b2c6cc9df1ea985fedbaf7ab9
MD5 afa8ab2a82ca0c6ca8450e1fabdf51d8
BLAKE2b-256 3cf3b5b45fadb08de8053dfe49b5a7f30c88b54185dfe92882e1248e982da2d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page