Rethinkinig Large-scale Dataset Compression
Project description
Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
[Paper | BibTex | 🤗Dataset | 📂Logs]
Official Implementation for "Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images".
Lingao Xiao, Songhua Liu, Yang He*, Xinchao Wang
Abstract: Dataset distillation and dataset pruning aim to compress datasets to improve computational and storage efficiency. However, due to differing applications, they are typically not compared directly, creating uncertainty about their relative performance. Additionally, inconsistencies in evaluation settings among dataset distillation studies prevent fair comparisons and hinder reproducibility. Therefore, there is an urgent need for a benchmark that can equitably evaluate methodologies across both distillation and pruning literature. Notably, our benchmark has demonstrated the effectiveness of soft labels in evaluations, even for randomly selected subsets. This advantage has shifted researchers' focus away from the images themselves, but soft labels are cumbersome to store and use. To address these concerns, we propose a framework, Prune, Combine, and Augment (PCA), which prioritizes image data and relies solely on hard labels for evaluation. Our benchmark and framework aim to refocus attention on image data in dataset compression research, paving the way for more balanced and accessible techniques.
TODOs
- release large-scale benchmark
- release SOTA datasets
- release PCA framework
- release PCA datasets
*Note: for soft label benchmark, we use fast evaluation code without relabeling.
Datasets (🤗Hugging Face)
SOTA datasets used in our experiments are available at 🤗Hugging Face. We have preprocessed all images into fixed 224x224 resolutioins and creates the datasets for a fair storage comparison.
| Method | Type | Venue | Dataset Key | Avaiable IPCs |
|---|---|---|---|---|
| random | - | - | he-yang/2025-rethinkdc-imagenet-random-ipc-[IPC] |
[1,10,20,50,100,200] |
| SRe2L | D |
NeurIPS'23 | he-yang/2025-rethinkdc-imagenet-sre2l-ipc-[IPC] |
[10,50,100] |
| CDA | D |
TMLR'24 | he-yang/2025-rethinkdc-imagenet-cda-ipc-[IPC] |
[10,50,100] |
| G-VBSM | D |
CVPR'24 | he-yang/2025-rethinkdc-imagenet-gvbsm-ipc-[IPC] |
[10,50,100] |
| LPLD | D |
NeurIPS'24 | he-yang/2025-rethinkdc-imagenet-lpld-ipc-[IPC] |
[10,50,100] |
| RDED | D |
CVPR'24 | he-yang/2025-rethinkdc-imagenet-rded-ipc-[IPC] |
[10,50,100] |
| DWA | D |
NeurIPS'24 | he-yang/2025-rethinkdc-imagenet-dwa-ipc-[IPC] |
[10,50,100] |
| Forgetting | P |
ICLR'19 | he-yang/2025-rethinkdc-imagenet-forgetting-ipc-[IPC] |
[10,50,100] |
| EL2N | P |
NeurIPS'21 | he-yang/2025-rethinkdc-imagenet-el2n-ipc-[IPC] |
[10,50,100] |
| AUM | P |
NeurIPS'20 | he-yang/2025-rethinkdc-imagenet-aum-ipc-[IPC] |
[10,50,100] |
| CCS | P |
ICLR'23 | he-yang/2025-rethinkdc-imagenet-ccs-ipc-[IPC] |
[10,50,100] |
Ddenotes dataset distillation literatures, andPis dataset pruning.
To use it, you do NOT need to download them manually:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("[Dataset-Key]")
Or, simply install our package and put the dataset-key as training directory. For more details, please follow Package Usage.
Installation
Install from pip (tested on python=3.12)
pip install rethinkdc
Install from source
Step 1: Clone Repo,
git clone https://github.com/ArmandXiao/Rethinking-Dataset-Compression.git
cd Rethinking-Dataset-Compression
Step 2: Create Environment,
conda env create -f environment.yml
conda activate rethinkdc
Step 3: Install Benchmark,
make build
make install
Usage
Prepare:
export IMAGENET_VAL_DIR=[YOUR_PATH_TO_IMAGENET_VALIDATION_DIR]
# example
export IMAGENET_VAL_DIR="./imagenet/val"
General Usage:
rethinkdc --help
Example Usage:
# use default soft/hard standard evaluation setting
rethinkdc [YOUR_PATH_TO_DATASET] [*ARGS]
# change training dataset (test random dataset)
rethinkdc he-yang/2025-rethinkdc-imagenet-random-ipc-10 --soft --ipc 10 --output-dir ./random_ipc10_soft
- more examples can be found in folder script
Main Table Result (📂Google Drive)
Logs for main tables are results provided in google drive for reference.
| Table | Explanation |
|---|---|
| Table 3 | Random baselines in soft label setting with standard evaluation |
| Table 4 & Table 18 | SOTA methods in soft label setting with std |
| Table 5 & Table 19 | SOTA methods in hard label setting with std |
| Table 6 | SOTA Pruning Rules |
| Table 7 | Ablation Study of PCA |
| Table 8 | Cross-architecture Performance of PCA |
| Table 12 & Table 22 | Regularization-based Data Augmentation |
| Table 20 | Pure Noise as Input |
| Table 24 | PCA using Different Pruning Methods |
Related Repos
Our repo is either build upon the following repos:
- https://github.com/VILA-Lab/SRe2L
- https://github.com/he-y/soft-label-pruning-for-dataset-distillation
- https://github.com/haizhongzheng/Coverage-centric-coreset-selection
Similar Repos:
Citation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rethinkdc-0.3.2.tar.gz.
File metadata
- Download URL: rethinkdc-0.3.2.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b0e1a79d1a2d463035c4623fa9fb6b406e0a265823564a4d5736bd83e4f3796
|
|
| MD5 |
5f48f4bbfe621c2bb6a381ab7524ae65
|
|
| BLAKE2b-256 |
24ea8cd8fdb304b0b110daaac362e4be4a264caeb887e62a0351f693a42f051c
|
File details
Details for the file rethinkdc-0.3.2-py3-none-any.whl.
File metadata
- Download URL: rethinkdc-0.3.2-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55b19e217980b50a51474917fe94f899ffe60c99e94a876c2cf667fb61cc870a
|
|
| MD5 |
59a37683ade2db722517af582b289023
|
|
| BLAKE2b-256 |
0c73a9a7fc9c1b45115e7628ab0cf4b7bdcf616b19d2c04c54417556c76cee4d
|