Skip to main content

Quantify and analyze distribution shifts from samples.

Project description


DataShifts Logo



DataShifts — A Toolkit for Quantifying Distribution Shifts

PyPI version PyPI downloads License

DataShifts is a Python package that makes it simple to measure and analyze the distribution shifts from labeled samples. It can be used with tensor computation frameworks such as PyTorch, NumPy and KeOps. It is designed for data science practitioners who need a principled way to answer questions such as:

  • How far has my production data shifted from the training set?

  • How do the model’s representations shift in a new domain, and are they robust to distribution shifts?

  • Are the distribution shifts mainly in the inputs (covariate shift) or in the labels (concept shift)?

  • How do these distribution shifts affect model performance?

In analysis, distribution shift is often decomposed into covariate shift ($X$ shift) and concept shift ($Y|X$ shift). The general theory below shows that the error bound scales linearly with these two shifts. With a single call, DataShifts estimates these two shifts from labeled samples, providing a rigorous and general tool for quantifying and analyzing distribution shift.


Core Theory — General Learning Bound under Distribution Shifts

Let the covariate and label spaces be metric spaces $(\mathcal{X} ,\rho _{\mathcal{X}}),(\mathcal{Y} ,\rho _{\mathcal{Y}})$, and $\mathcal{D} _{XY}^{A}, \mathcal{D} _{XY}^{B}$ are two joint distributions of covariates and labels on $\mathcal{X}\times\mathcal{Y}$. If the hypothesis $h:\mathcal{X} \rightarrow \mathcal{Y}'$ is $L _h$-Lipschitz continuous, loss $\ell :\mathcal{Y} \times \mathcal{Y} '\rightarrow \mathbb{R}$ is separately $(L _{\ell},L _{\ell}')$-Lipschitz continuous, then:

$$ \LARGE \epsilon _B(h)\le \epsilon _A(h)+L _hL _{\ell}'S _{Cov}+L _{\ell}S _{Cpt}^{\gamma ^*} $$

where $\epsilon _A(h), \epsilon _B(h)$ are the errors of hypothesis $h$ under the distributions $\mathcal{D} _{XY}^{A}, \mathcal{D} _{XY}^{B}$, respectively. $S _{Cov}, S _{Cpt}^{\gamma ^*}$ are covariate shift (= $X$ shift, distribution shift of covariates) and concept shift (= $Y|X$ shift, distribution shift of labels conditioned on covariates) between $\mathcal{D} _{XY}^{A}, \mathcal{D} _{XY}^{B}$. Both shifts are defined in closed form via entropic optimal transport.

This elegant theory shows how distribution shifts affect the error, and has the following advantages:

  • General: Because the theory assumes no particular loss or space, it applies broadly to losses and tasks—including regression, classification, and multi-label problems, as long as the covariate and label space of the problem can define metrics. Moreover, depending on whether the covariate space is the raw feature space or the model’s representation space, the theory can measure shifts in either the original data or the learned representations.

  • Estimable: Both covariate shift $S _{Cov}$ and concept shift $S _{Cpt}^{\gamma ^*}$ in the theory can be rigorously estimated from finite samples drawn from the two distributions—which is the core capability of this package.

For further theoretical details, please see our original paper.


Installation

Just use the following command to install DataShifts package:

pip install datashifts

Quick Example

import torch
from datashifts import DataShifts

# Generate data from two different distributions (take labels originating from pure noise as an example)
N=10000       #Number of samples
x_dim=200     #Feature dimensions
y_dim=10      #Label dimensions
x_shift=10.0  #True covariate shift
device="cuda" #Device

random_directions=torch.randn(1, x_dim, device=device)
x_shift_vector=random_directions/((random_directions**2).sum()**(1/2))*x_shift
# First distribution
x1 = torch.randn(N, x_dim, device=device)
y1= torch.rand(N, y_dim, device=device)
# Second distribution
x2 = torch.randn(N, x_dim, device=device)+x_shift_vector
y2= torch.rand(N, y_dim, device=device)

# Using DataShifts to quantify covariate and concept shifts
covariate_shift, concept_shift=DataShifts(x1, x2, y1, y2)
print("Covariate shift: ", covariate_shift)
print("Concept shift: ",   concept_shift  )

Typical output

The sample size of (x1,y1,w1) is larger than parameter 'N_max'=5000, sampling strategy is used.
The sample size of (x2,y2,w2) is larger than parameter 'N_max'=5000, sampling strategy is used.
Covariate shift: tensor(9.9608, device='cuda:0')
Concept shift:   tensor(1.2627, device='cuda:0')

datashifts.DataShifts —  Measure Covariate  &  Concept Shift between Distributions from Samples

datashifts.DataShifts is the core method of the DataShifts package, which estimates covariate shift and concept shift from finite labeled samples (x1,y1), (x2,y2) drawn from two distributions, with automatic sub‑sampling for scalability and GPU acceleration.

covariate_shift, concept_shift = DataShifts(
            x1, x2, y1, y2,                    # required
            weights1=None, weights2=None,      # optional importance weights
            eps=0.01,                          # entropic regularisation
            N_max=5000,                        # max points kept per distribution
            device=None,                       # "cpu", "cuda" or None (auto)
            seed=None,                         # random seed for reproducibility
            verbose=True                       # print progress messages
)

Note (temporary): For now, Euclidean distance is the only built-in metric. Custom metrics are planned.

Parameters

name type default description
x1, x2 torch.Tensor or numpy.ndarray Covariates of the samples drawn from two distributions.
Shapes accepted:(Batch_size, Num_samples, Dim_x) or (Num_samples, Dim_x)
y1, y2 torch.Tensor or numpy.ndarray Corresponding labels.
Shapes accepted:(Batch_size, Num_samples, Dim_y) or (Num_samples, Dim_y). Must match x* in Batch_size and Num_samples dimensions.
weights1, weights2 torch.Tensor or numpy.ndarray None Sample weights.
Shapes accepted:(Batch_size, Num_samples) or (Num_samples). Must match x* in Batch_size and Num_samples dimensions.
eps float 0.01 Entropic regularisation for optimal transport. Smaller => more precise but slower.
N_max int 5000 Upper bound on samples per distribution kept for optimal transport. If N>N_max, the function resamples without replacement to speed up the solution (weighted if weights* provided). Larger => more precise but slower.
device str None Running device.
"cpu", "cuda"/"gpu", or None(= automatically use GPU if available).
seed int None Random seed for shuffling and sampling.
verbose bool True Whether to print progress messages (sampling or automatic device choice).

Returns

covariate_shift : torch.Tensor
concept_shift   : torch.Tensor

Returned objects are PyTorch tensors placed on the chosen device.


Licensing, Citation, Academic Use

This package is released under the MIT License. See the LICENSE file for full details.

If you use this package in a research paper, please cite our original paper:

@article{chen2025general,
  title={General and Estimable Learning Bound Unifying Covariate and Concept Shifts},
  author={Chen, Hongbo and Xia, Li Charlie},
  journal={arXiv preprint arXiv:2506.12829},
  year={2025}
}

Contributions & issues welcome at https://github.com/DataShifts/datashifts/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datashifts-0.8.1.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datashifts-0.8.1-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file datashifts-0.8.1.tar.gz.

File metadata

  • Download URL: datashifts-0.8.1.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.12

File hashes

Hashes for datashifts-0.8.1.tar.gz
Algorithm Hash digest
SHA256 b4c172593dc444efd9d8ef22fa85bc338b27dee386cf9c436c388288a6fb7bd8
MD5 c4c92b78ea2ff106fc5c69274d7e64a6
BLAKE2b-256 72c67d4d3b40d105cb2a04c49a10f91e6aabd8d1b319cdee06c8413da13fb6a9

See more details on using hashes here.

File details

Details for the file datashifts-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: datashifts-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.12

File hashes

Hashes for datashifts-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2aa534d089a9a339e761311fd4d2ebd590d67b7953c5b1b8fda33a90d1eb99c
MD5 e675541a19db74cd39f20b178d3e3206
BLAKE2b-256 083adbe0c13a7c37e8db9260be214829780b4a4bc94418e205aa5b6479f42ee7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page