Density–Diversity Coresets (DDC) for dataset summarization and distributional approximation

These details have not been verified by PyPI

Project links

Project description

dd-coresets

Density–Diversity Coresets (DDC): a small weighted set of real data points that approximates the empirical distribution of a large dataset.

This library exposes a simple API (in the spirit of scikit-learn) to:

build an unsupervised density–diversity coreset (fit_ddc_coreset);
compare against random and stratified baselines (fit_random_coreset, fit_stratified_coreset).

The goal is pragmatic: help data scientists work with large datasets using small, distribution-preserving subsets that are easy to simulate, visualise, and explain.

Motivation

In practice, we rarely work on the full dataset for everything:

Exploratory plots and dashboards need small, interpretable samples.
Scenario analysis and simulations need few representative points with weights.
Prototyping models and ideas is faster on coresets than on full data.

Common approaches:

Random sampling: simple, but can miss important modes or tails.
Stratified sampling: good when you already know the right strata (segments, classes, products), but needs domain knowledge and alignment with stakeholders.
Cluster centroids (e.g. k-means): minimise reconstruction error, but centroids are not real data points and are not explicitly distributional.

DDC sits in between:

Unsupervised, geometry-aware.
Selects real points (medoids) that cover dense regions and diverse modes.
Learns weights via soft assignments, approximating the empirical distribution.

Installation

git clone https://github.com/crbazevedo/dd-coresets.git
cd dd-coresets

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

Dependencies (minimal):

numpy
pandas
scikit-learn
matplotlib (for examples/plots)

Quickstart

1. Fit a DDC coreset (unsupervised default)

import numpy as np
from dd_coresets.ddc import fit_ddc_coreset

# X: (n, d) preprocessed features (e.g. scaled, encoded, etc.)
X = ...  # load your data here

S, w, info = fit_ddc_coreset(
    X,
    k=200,           # number of representatives
    n0=20000,        # working sample size (None = use all)
    m_neighbors=32,  # kNN for density
    alpha=0.3,       # density–diversity trade-off
    gamma=1.0,       # kernel scale
    refine_iters=1,  # medoid refinement iters
    reweight_full=True,
    random_state=42,
)

# S: (k, d) real data points
# w: (k,) non-negative, sum to 1
# info: metadata (indices, parameters, etc.)
print(S.shape, w.shape)
print(info.method, info.k, info.n, info.n0)

You can now use (S, w) for:

simulation / scenario analysis,
plotting weighted histograms or KDEs,
approximate distributional comparisons.

2. Random and stratified baselines

from dd_coresets.ddc import (
    fit_random_coreset,
    fit_stratified_coreset,
)

# Random coreset (no domain knowledge)
S_rnd, w_rnd, info_rnd = fit_random_coreset(
    X,
    k=200,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=0,
)

# Stratified coreset (when you have strata)
# strata: 1D array, same length as X, e.g. segment, class, product line
strata = ...  # e.g. y labels or business segments

S_strat, w_strat, info_strat = fit_stratified_coreset(
    X,
    strata=strata,
    k=200,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=0,
)

Use these baselines to benchmark DDC on your data (moment errors, Wasserstein distances, etc.).

API Overview

All functions assume X is a NumPy array of shape (n, d) with preprocessed numerical features (e.g. scaled, encoded, etc.).

`fit_ddc_coreset`

S, w, info = fit_ddc_coreset(
    X,
    k,
    n0=20000,
    m_neighbors=32,
    alpha=0.3,
    gamma=1.0,
    refine_iters=1,
    reweight_full=True,
    random_state=None,
)

Parameters
- X: (n, d) array-like, preprocessed data.
- k: number of representatives.
- n0: working sample size. If None or >= n, uses all data.
- m_neighbors: kNN parameter for local density.
- alpha: density–diversity trade-off (0 ≈ diversity, 1 ≈ density).
- gamma: kernel scale multiplier (used in soft assignment).
- refine_iters: medoid refinement iterations (usually 1 is enough).
- reweight_full: if True, reweights using the full dataset; else uses only the working sample.
- random_state: RNG seed.
Returns
- S: (k, d) representatives (real data points).
- w: (k,) weights (w >= 0, sum(w) = 1).
- info: CoresetInfo with metadata (method name, n, n0, indices, params).

Recommended use:
Default choice when you do not yet know which strata or labels matter. Good for EDA, exploratory simulation, and early-stage modelling.

`fit_random_coreset`

S, w, info = fit_random_coreset(
    X,
    k,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=None,
)

Samples k points uniformly from a working sample (size n0) and applies the same soft-weighting scheme as DDC.

Use case:
Baseline to compare against DDC and stratified; reflects what many teams do today (simple downsampling).

`fit_stratified_coreset`

S, w, info = fit_stratified_coreset(
    X,
    strata,
    k,
    n0=20000,
    gamma=1.0,
    reweight_full=True,
    random_state=None,
)

Parameters
- X: (n, d) data.
- strata: 1D array of length n with stratum labels (e.g. product, region, class, risk band).
- Other parameters analogous to fit_random_coreset.
Internally:
- Computes stratum frequencies on the working sample.
- Allocates k_g reps per stratum ∝ frequency.
- Samples uniformly inside each stratum.
- Applies the same soft-weighting scheme as DDC.

Use case:
When you know the relevant strata and must preserve their proportions (regulatory reporting, risk/actuarial slices, business segments).

Experiments

The repo includes two example scripts under experiments/:

synthetic_ddc_vs_baselines.py
5D Gaussian mixture with:
- DDC vs Random vs Stratified,
- metrics: mean/cov/corr errors, Wasserstein-1 marginals, KS,
- basic plots.
multimodal_2d_ring_ddc.py
2D example (3 Gaussians + ring) for visual intuition:
- shows how DDC covers multiple modes and a ring structure with few reps.

Run:

python experiments/synthetic_ddc_vs_baselines.py
python experiments/multimodal_2d_ring_ddc.py

When to use what?

DDC (fit_ddc_coreset):
Default in low-knowledge regimes (no clear strata yet). Better than random sampling for a fixed k.
Stratified (fit_stratified_coreset):
Preferred in high-knowledge regimes (well-defined strata aligned with the task, e.g. risk bands, products), especially when k is large enough.
Random (fit_random_coreset):
Baseline and sanity check; still useful when you want the simplest possible comparison.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Nov 15, 2025

0.1.3

Nov 12, 2025

0.1.2

Nov 12, 2025

0.1.1

Nov 12, 2025

This version

0.1.0

Nov 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dd_coresets-0.1.0.tar.gz (8.4 kB view details)

Uploaded Nov 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dd_coresets-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Nov 11, 2025 Python 3

File details

Details for the file dd_coresets-0.1.0.tar.gz.

File metadata

Download URL: dd_coresets-0.1.0.tar.gz
Upload date: Nov 11, 2025
Size: 8.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dd_coresets-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ff2b537759cf923b7216dec56010b1a030f2f715af73f8d855cb5fb296b5d634`
MD5	`5ec462d72ac5a6bc8bf901cf19c3d67b`
BLAKE2b-256	`7fc707ce6e63cdbfb4f24f02ec6e9947b5d452c556f8e3422f5cfc30afeb3049`

See more details on using hashes here.

File details

Details for the file dd_coresets-0.1.0-py3-none-any.whl.

File metadata

Download URL: dd_coresets-0.1.0-py3-none-any.whl
Upload date: Nov 11, 2025
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dd_coresets-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6177c6f7696cc55d778735eb475bf92fd5fc64d17e579e7da80e57830081145f`
MD5	`967bddd05ea8d89b69267c5aeecbbed6`
BLAKE2b-256	`768f7c939c4f49b59a38148a062a7c599ef284421b80cb8d5e0f46a1d56f3a8b`

See more details on using hashes here.

dd-coresets 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

dd-coresets

Motivation

Installation

Quickstart

1. Fit a DDC coreset (unsupervised default)

2. Random and stratified baselines

API Overview

`fit_ddc_coreset`

`fit_random_coreset`

`fit_stratified_coreset`

Experiments

When to use what?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes