A pipeline for curating and sanitizing large-scale image datasets.
Project description
Vision Data Curation (VDC)
A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.
Status
This project is in early development. Most features are functional, but APIs may still change.
- Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
- Features in Progress: Duplicate removal, Rotation correction
Feedback and contributions are welcome.
Features
VDC provides modular tools for dataset cleanup:
- Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
- Example-based filtering - remove images similar to a set of unwanted examples
- Image Quality Filtering - remove images based on aesthetic score or NSFW classification
- Hierarchical K-Means sampling - select diverse, representative subsets from large datasets
Coming soon:
- Duplicate removal
- Rotation correction (correct 90°/180°/270° orientation errors)
The Curation Pipeline
flowchart LR
A[Raw<br/>Dataset] --> V[Validation]
V --> R[Rotation*]
R --> D[Dedup*]
D --> E[Example<br/>Filter]
E --> Q[Quality Filter<br/>Aesthetic/NSFW/Skip]
Q --> S[Cluster-based<br/>Sampling]
S --> F[Curated<br/>Dataset]
U[Unwanted<br/>Examples] --> E
Note: * = WIP
Installation
From PyPI
pip install vision-data-curation
From Source
git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .
Developing directly from the project root allows for script and configuration execution as if fully installed.
Usage
Each step is a script under vdc.scripts.
Examples:
# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/
# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
- Run
python -m vdc.scriptsto see available scripts - Run
python -m vdc.scripts.<script> --helpfor options
Configuration:
- Default settings live in vdc/conf/config.json
- A
config.jsonin your project root will take precedence (or pass--configto any script)
License
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vision_data_curation-0.0.1.dev5.tar.gz.
File metadata
- Download URL: vision_data_curation-0.0.1.dev5.tar.gz
- Upload date:
- Size: 35.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18df685ac579ed40af26839c12a5ededec1f32766c878858eb24afa042d221ec
|
|
| MD5 |
6acd5eb2c9f3d96726d4088be9ac35ff
|
|
| BLAKE2b-256 |
a0dea7e1da8097ce1813040dc43c2e5d78b156e0e6b153327eb100c93e60231f
|
File details
Details for the file vision_data_curation-0.0.1.dev5-py3-none-any.whl.
File metadata
- Download URL: vision_data_curation-0.0.1.dev5-py3-none-any.whl
- Upload date:
- Size: 43.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bad868b6c08b2821b0a92522038f0ebb0a1e08867de8114d02872b4b5a3b979d
|
|
| MD5 |
455d9379f525ed2471df71fb1c1efa6d
|
|
| BLAKE2b-256 |
3822e53969ab6084d6ff3f8dde870475e8627c82a6999e18904984f91789833e
|