Skip to main content

Rethinkinig Large-scale Dataset Compression

Project description

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

[Paper | BibTex | 🤗Dataset | 📂Logs]


Official Implementation for "Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images".

Lingao XiaoSonghua LiuYang He*Xinchao Wang

Abstract: Dataset distillation and dataset pruning aim to compress datasets to improve computational and storage efficiency. However, due to differing applications, they are typically not compared directly, creating uncertainty about their relative performance. Additionally, inconsistencies in evaluation settings among dataset distillation studies prevent fair comparisons and hinder reproducibility. Therefore, there is an urgent need for a benchmark that can equitably evaluate methodologies across both distillation and pruning literature. Notably, our benchmark has demonstrated the effectiveness of soft labels in evaluations, even for randomly selected subsets. This advantage has shifted researchers' focus away from the images themselves, but soft labels are cumbersome to store and use. To address these concerns, we propose a framework, Prune, Combine, and Augment (PCA), which prioritizes image data and relies solely on hard labels for evaluation. Our benchmark and framework aim to refocus attention on image data in dataset compression research, paving the way for more balanced and accessible techniques.

TODOs

  • release large-scale benchmark
  • release SOTA datasets
  • release PCA framework
  • release PCA datasets

*Note: for soft label benchmark, we use fast evaluation code without relabeling.

Datasets (🤗Hugging Face)

SOTA datasets used in our experiments are available at 🤗Hugging Face. We have preprocessed all images into fixed 224x224 resolutioins and creates the datasets for a fair storage comparison.

Method Type Venue Dataset Key Avaiable IPCs
random - - he-yang/2025-rethinkdc-imagenet-random-ipc-[IPC] [1,10,20,50,100,200]
SRe2L D NeurIPS'23 he-yang/2025-rethinkdc-imagenet-sre2l-ipc-[IPC] [10,50,100]
CDA D TMLR'24 he-yang/2025-rethinkdc-imagenet-cda-ipc-[IPC] [10,50,100]
G-VBSM D CVPR'24 he-yang/2025-rethinkdc-imagenet-gvbsm-ipc-[IPC] [10,50,100]
LPLD D NeurIPS'24 he-yang/2025-rethinkdc-imagenet-lpld-ipc-[IPC] [10,50,100]
RDED D CVPR'24 he-yang/2025-rethinkdc-imagenet-rded-ipc-[IPC] [10,50,100]
DWA D NeurIPS'24 he-yang/2025-rethinkdc-imagenet-dwa-ipc-[IPC] [10,50,100]
Forgetting P ICLR'19 he-yang/2025-rethinkdc-imagenet-forgetting-ipc-[IPC] [10,50,100]
EL2N P NeurIPS'21 he-yang/2025-rethinkdc-imagenet-el2n-ipc-[IPC] [10,50,100]
AUM P NeurIPS'20 he-yang/2025-rethinkdc-imagenet-aum-ipc-[IPC] [10,50,100]
CCS P ICLR'23 he-yang/2025-rethinkdc-imagenet-ccs-ipc-[IPC] [10,50,100]
  • D denotes dataset distillation literatures, and P is dataset pruning.

To use it, you do NOT need to download them manually:

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("[Dataset-Key]")

Or, simply install our package and put the dataset-key as training directory. For more details, please follow Package Usage.

Installation

Install from pip (tested on python=3.12)

pip install rethinkdc
Install from source

Step 1: Clone Repo,

git clone https://github.com/ArmandXiao/Rethinking-Dataset-Compression.git
cd Rethinking-Dataset-Compression

Step 2: Create Environment,

conda env create -f environment.yml
conda activate rethinkdc

Step 3: Install Benchmark,

make build
make install

Usage

Prepare:

export IMAGENET_VAL_DIR=[YOUR_PATH_TO_IMAGENET_VALIDATION_DIR]

# example
export IMAGENET_VAL_DIR="./imagenet/val"

General Usage:

rethinkdc --help

Example Usage:

# use default soft/hard standard evaluation setting
rethinkdc [YOUR_PATH_TO_DATASET] [*ARGS]

# change training dataset (test random dataset)
rethinkdc he-yang/2025-rethinkdc-imagenet-random-ipc-10 --soft --ipc 10 --output-dir ./random_ipc10_soft
  • more examples can be found in folder script

Main Table Result (📂Google Drive)

Logs for main tables are results provided in google drive for reference.

Table Explanation
Table 3 Random baselines in soft label setting with standard evaluation
Table 4 & Table 18 SOTA methods in soft label setting with std
Table 5 & Table 19 SOTA methods in hard label setting with std
Table 6 SOTA Pruning Rules
Table 7 Ablation Study of PCA
Table 8 Cross-architecture Performance of PCA
Table 12 & Table 22 Regularization-based Data Augmentation
Table 20 Pure Noise as Input
Table 24 PCA using Different Pruning Methods

Related Repos

Our repo is either build upon the following repos:

Similar Repos:

Citation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rethinkdc-0.3.2.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rethinkdc-0.3.2-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file rethinkdc-0.3.2.tar.gz.

File metadata

  • Download URL: rethinkdc-0.3.2.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rethinkdc-0.3.2.tar.gz
Algorithm Hash digest
SHA256 9b0e1a79d1a2d463035c4623fa9fb6b406e0a265823564a4d5736bd83e4f3796
MD5 5f48f4bbfe621c2bb6a381ab7524ae65
BLAKE2b-256 24ea8cd8fdb304b0b110daaac362e4be4a264caeb887e62a0351f693a42f051c

See more details on using hashes here.

File details

Details for the file rethinkdc-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: rethinkdc-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rethinkdc-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 55b19e217980b50a51474917fe94f899ffe60c99e94a876c2cf667fb61cc870a
MD5 59a37683ade2db722517af582b289023
BLAKE2b-256 0c73a9a7fc9c1b45115e7628ab0cf4b7bdcf616b19d2c04c54417556c76cee4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page