Skip to main content

Density–Diversity Coresets (DDC) for dataset summarization and distributional approximation

Project description

dd-coresets

Density–Diversity Coresets (DDC): a small weighted set of real data points that approximates the empirical distribution of a large dataset.

This library exposes a simple API (in the spirit of scikit-learn) to:

  • build an unsupervised density–diversity coreset (fit_ddc_coreset);
  • compare against random and stratified baselines (fit_random_coreset, fit_stratified_coreset).

The goal is pragmatic: help data scientists work with large datasets using small, distribution-preserving subsets that are easy to simulate, visualise, and explain.


Motivation

In practice, we rarely work on the full dataset for everything:

  • Exploratory plots and dashboards need small, interpretable samples.
  • Scenario analysis and simulations need few representative points with weights.
  • Prototyping models and ideas is faster on coresets than on full data.

Common approaches:

  • Random sampling: simple, but can miss important modes or tails.
  • Stratified sampling: good when you already know the right strata (segments, classes, products), but needs domain knowledge and alignment with stakeholders.
  • Cluster centroids (e.g. k-means): minimise reconstruction error, but centroids are not real data points and are not explicitly distributional.

DDC sits in between:

  • Unsupervised, geometry-aware.
  • Selects real points (medoids) that cover dense regions and diverse modes.
  • Learns weights via soft assignments, approximating the empirical distribution.

Installation

git clone https://github.com/crbazevedo/dd-coresets.git
cd dd-coresets

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

Dependencies (minimal):

  • numpy
  • pandas
  • scikit-learn
  • matplotlib (for examples/plots)

Quickstart

1. Fit a DDC coreset (unsupervised default)

import numpy as np
from dd_coresets.ddc import fit_ddc_coreset

# X: (n, d) preprocessed features (e.g. scaled, encoded, etc.)
X = ...  # load your data here

S, w, info = fit_ddc_coreset(
    X,
    k=200,           # number of representatives
    n0=20000,        # working sample size (None = use all)
    m_neighbors=32,  # kNN for density
    alpha=0.3,       # density–diversity trade-off
    gamma=1.0,       # kernel scale
    refine_iters=1,  # medoid refinement iters
    reweight_full=True,
    random_state=42,
)

# S: (k, d) real data points
# w: (k,) non-negative, sum to 1
# info: metadata (indices, parameters, etc.)
print(S.shape, w.shape)
print(info.method, info.k, info.n, info.n0)

You can now use (S, w) for:

  • simulation / scenario analysis,
  • plotting weighted histograms or KDEs,
  • approximate distributional comparisons.

2. Random and stratified baselines

from dd_coresets.ddc import (
    fit_random_coreset,
    fit_stratified_coreset,
)

# Random coreset (no domain knowledge)
S_rnd, w_rnd, info_rnd = fit_random_coreset(
    X,
    k=200,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=0,
)

# Stratified coreset (when you have strata)
# strata: 1D array, same length as X, e.g. segment, class, product line
strata = ...  # e.g. y labels or business segments

S_strat, w_strat, info_strat = fit_stratified_coreset(
    X,
    strata=strata,
    k=200,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=0,
)

Use these baselines to benchmark DDC on your data (moment errors, Wasserstein distances, etc.).


API Overview

All functions assume X is a NumPy array of shape (n, d) with preprocessed numerical features (e.g. scaled, encoded, etc.).

fit_ddc_coreset

S, w, info = fit_ddc_coreset(
    X,
    k,
    n0=20000,
    m_neighbors=32,
    alpha=0.3,
    gamma=1.0,
    refine_iters=1,
    reweight_full=True,
    random_state=None,
)
  • Parameters

    • X: (n, d) array-like, preprocessed data.
    • k: number of representatives.
    • n0: working sample size. If None or >= n, uses all data.
    • m_neighbors: kNN parameter for local density.
    • alpha: density–diversity trade-off (0 ≈ diversity, 1 ≈ density).
    • gamma: kernel scale multiplier (used in soft assignment).
    • refine_iters: medoid refinement iterations (usually 1 is enough).
    • reweight_full: if True, reweights using the full dataset; else uses only the working sample.
    • random_state: RNG seed.
  • Returns

    • S: (k, d) representatives (real data points).
    • w: (k,) weights (w >= 0, sum(w) = 1).
    • info: CoresetInfo with metadata (method name, n, n0, indices, params).

Recommended use:
Default choice when you do not yet know which strata or labels matter. Good for EDA, exploratory simulation, and early-stage modelling.


fit_random_coreset

S, w, info = fit_random_coreset(
    X,
    k,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=None,
)
  • Samples k points uniformly from a working sample (size n0) and applies the same soft-weighting scheme as DDC.

Use case:
Baseline to compare against DDC and stratified; reflects what many teams do today (simple downsampling).


fit_stratified_coreset

S, w, info = fit_stratified_coreset(
    X,
    strata,
    k,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=None,
)
  • Parameters

    • X: (n, d) data.
    • strata: 1D array of length n with stratum labels (e.g. product, region, class, risk band).
    • Other parameters analogous to fit_random_coreset.
  • Internally:

    • Computes stratum frequencies on the working sample.
    • Allocates k_g reps per stratum ∝ frequency.
    • Samples uniformly inside each stratum.
    • Applies the same soft-weighting scheme as DDC.

Use case:
When you know the relevant strata and must preserve their proportions (regulatory reporting, risk/actuarial slices, business segments).


Experiments

The repo includes two example scripts under experiments/:

  • synthetic_ddc_vs_baselines.py
    5D Gaussian mixture with:

    • DDC vs Random vs Stratified,
    • metrics: mean/cov/corr errors, Wasserstein-1 marginals, KS,
    • basic plots.
  • multimodal_2d_ring_ddc.py
    2D example (3 Gaussians + ring) for visual intuition:

    • shows how DDC covers multiple modes and a ring structure with few reps.

Run:

python experiments/synthetic_ddc_vs_baselines.py
python experiments/multimodal_2d_ring_ddc.py

When to use what?

  • DDC (fit_ddc_coreset):
    Default in low-knowledge regimes (no clear strata yet). Better than random sampling for a fixed k.

  • Stratified (fit_stratified_coreset):
    Preferred in high-knowledge regimes (well-defined strata aligned with the task, e.g. risk bands, products), especially when k is large enough.

  • Random (fit_random_coreset):
    Baseline and sanity check; still useful when you want the simplest possible comparison.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dd_coresets-0.1.0.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dd_coresets-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file dd_coresets-0.1.0.tar.gz.

File metadata

  • Download URL: dd_coresets-0.1.0.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dd_coresets-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ff2b537759cf923b7216dec56010b1a030f2f715af73f8d855cb5fb296b5d634
MD5 5ec462d72ac5a6bc8bf901cf19c3d67b
BLAKE2b-256 7fc707ce6e63cdbfb4f24f02ec6e9947b5d452c556f8e3422f5cfc30afeb3049

See more details on using hashes here.

File details

Details for the file dd_coresets-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dd_coresets-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dd_coresets-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6177c6f7696cc55d778735eb475bf92fd5fc64d17e579e7da80e57830081145f
MD5 967bddd05ea8d89b69267c5aeecbbed6
BLAKE2b-256 768f7c939c4f49b59a38148a062a7c599ef284421b80cb8d5e0f46a1d56f3a8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page