Rethinkinig Large-scale Dataset Compression

These details have not been verified by PyPI

Project description

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

[Paper | BibTex | 🤗Dataset | 📂Logs]

Official Implementation for "Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images".

Lingao Xiao, Songhua Liu, Yang He*, Xinchao Wang

Abstract: Dataset distillation and dataset pruning aim to compress datasets to improve computational and storage efficiency. However, due to differing applications, they are typically not compared directly, creating uncertainty about their relative performance. Additionally, inconsistencies in evaluation settings among dataset distillation studies prevent fair comparisons and hinder reproducibility. Therefore, there is an urgent need for a benchmark that can equitably evaluate methodologies across both distillation and pruning literature. Notably, our benchmark has demonstrated the effectiveness of soft labels in evaluations, even for randomly selected subsets. This advantage has shifted researchers' focus away from the images themselves, but soft labels are cumbersome to store and use. To address these concerns, we propose a framework, Prune, Combine, and Augment (PCA), which prioritizes image data and relies solely on hard labels for evaluation. Our benchmark and framework aim to refocus attention on image data in dataset compression research, paving the way for more balanced and accessible techniques.

TODOs

release large-scale benchmark
release SOTA datasets
release PCA framework
release PCA datasets

*Note: for soft label benchmark, we use fast evaluation code without relabeling.

Datasets (🤗Hugging Face)

SOTA datasets used in our experiments are available at 🤗Hugging Face. We have preprocessed all images into fixed 224x224 resolutioins and creates the datasets for a fair storage comparison.

Method	Type	Venue	Dataset Key	Avaiable IPCs
random	-	-	he-yang/2025-rethinkdc-imagenet-random-ipc-`[IPC]`	`[1,10,20,50,100,200]`
SRe2L	`D`	NeurIPS'23	he-yang/2025-rethinkdc-imagenet-sre2l-ipc-`[IPC]`	`[10,50,100]`
CDA	`D`	TMLR'24	he-yang/2025-rethinkdc-imagenet-cda-ipc-`[IPC]`	`[10,50,100]`
G-VBSM	`D`	CVPR'24	he-yang/2025-rethinkdc-imagenet-gvbsm-ipc-`[IPC]`	`[10,50,100]`
LPLD	`D`	NeurIPS'24	he-yang/2025-rethinkdc-imagenet-lpld-ipc-`[IPC]`	`[10,50,100]`
RDED	`D`	CVPR'24	he-yang/2025-rethinkdc-imagenet-rded-ipc-`[IPC]`	`[10,50,100]`
DWA	`D`	NeurIPS'24	he-yang/2025-rethinkdc-imagenet-dwa-ipc-`[IPC]`	`[10,50,100]`
Forgetting	`P`	ICLR'19	he-yang/2025-rethinkdc-imagenet-forgetting-ipc-`[IPC]`	`[10,50,100]`
EL2N	`P`	NeurIPS'21	he-yang/2025-rethinkdc-imagenet-el2n-ipc-`[IPC]`	`[10,50,100]`
AUM	`P`	NeurIPS'20	he-yang/2025-rethinkdc-imagenet-aum-ipc-`[IPC]`	`[10,50,100]`
CCS	`P`	ICLR'23	he-yang/2025-rethinkdc-imagenet-ccs-ipc-`[IPC]`	`[10,50,100]`

D denotes dataset distillation literatures, and P is dataset pruning.

To use it, you do NOT need to download them manually:

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("[Dataset-Key]")

Or, simply install our package and put the dataset-key as training directory. For more details, please follow Package Usage.

Installation

Install from pip (tested on python=3.12)

pip install rethinkdc

Install from source

Step 1: Clone Repo,

git clone https://github.com/ArmandXiao/Rethinking-Dataset-Compression.git
cd Rethinking-Dataset-Compression

Step 2: Create Environment,

conda env create -f environment.yml
conda activate rethinkdc

Step 3: Install Benchmark,

make build
make install

Usage

Prepare:

export IMAGENET_VAL_DIR=[YOUR_PATH_TO_IMAGENET_VALIDATION_DIR]

# example
export IMAGENET_VAL_DIR="./imagenet/val"

General Usage:

rethinkdc --help

Example Usage:

# use default soft/hard standard evaluation setting
rethinkdc [YOUR_PATH_TO_DATASET] [*ARGS]

# change training dataset (test random dataset)
rethinkdc he-yang/2025-rethinkdc-imagenet-random-ipc-10 --soft --ipc 10 --output-dir ./random_ipc10_soft

more examples can be found in folder script

Main Table Result (📂Google Drive)

Logs for main tables are results provided in google drive for reference.

Table	Explanation
Table 3	Random baselines in soft label setting with standard evaluation
Table 4 & Table 18	SOTA methods in soft label setting with std
Table 5 & Table 19	SOTA methods in hard label setting with std
Table 6	SOTA Pruning Rules
Table 7	Ablation Study of PCA
Table 8	Cross-architecture Performance of PCA
Table 12 & Table 22	Regularization-based Data Augmentation
Table 20	Pure Noise as Input
Table 24	PCA using Different Pruning Methods

Related Repos

Our repo is either build upon the following repos:

Similar Repos:

https://github.com/NUS-HPC-AI-Lab/DD-Ranking

Citation

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.2

Feb 7, 2025

0.3.1

Feb 7, 2025

0.3.0

Feb 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rethinkdc-0.3.2.tar.gz (19.5 kB view details)

Uploaded Feb 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rethinkdc-0.3.2-py3-none-any.whl (18.2 kB view details)

Uploaded Feb 7, 2025 Python 3

File details

Details for the file rethinkdc-0.3.2.tar.gz.

File metadata

Download URL: rethinkdc-0.3.2.tar.gz
Upload date: Feb 7, 2025
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rethinkdc-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`9b0e1a79d1a2d463035c4623fa9fb6b406e0a265823564a4d5736bd83e4f3796`
MD5	`5f48f4bbfe621c2bb6a381ab7524ae65`
BLAKE2b-256	`24ea8cd8fdb304b0b110daaac362e4be4a264caeb887e62a0351f693a42f051c`

See more details on using hashes here.

File details

Details for the file rethinkdc-0.3.2-py3-none-any.whl.

File metadata

Download URL: rethinkdc-0.3.2-py3-none-any.whl
Upload date: Feb 7, 2025
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rethinkdc-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55b19e217980b50a51474917fe94f899ffe60c99e94a876c2cf667fb61cc870a`
MD5	`59a37683ade2db722517af582b289023`
BLAKE2b-256	`0c73a9a7fc9c1b45115e7628ab0cf4b7bdcf616b19d2c04c54417556c76cee4d`

See more details on using hashes here.

rethinkdc 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

TODOs

Datasets (🤗Hugging Face)

Installation

Usage

Main Table Result (📂Google Drive)

Related Repos

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes