Skip to main content

Quantify and analyze distribution shifts from samples.

Project description

DataShifts Logo


DataShifts — A Toolkit for Quantifying Distribution Shifts

PyPI version PyPI downloads License

DataShifts is a Python package that makes it simple to measure and analyze the distribution shifts from labeled samples. It can be used with tensor computation frameworks such as PyTorch, NumPy and KeOps. It is designed for data science practitioners who need a principled way to answer questions such as:

  • How far has my production data shifted from the training set?

  • How do the model’s representations shift in a new domain, and are they robust to distribution shifts?

  • Are the distribution shifts mainly in the inputs (covariate shift) or in the labels (concept shift)?

  • How do these distribution shifts affect model performance?

In analysis, distribution shift is often decomposed into covariate shift ($X$ shift) and concept shift (=$Y|X$ shift). The general theory below shows that the error bound scales linearly with these two shifts. With a single call, DataShifts estimates these two shifts from labeled samples, providing a rigorous and general tool for quantifying and analyzing distribution shift.


Core Theory — General Learning Bound under Distribution Shifts

Let the covariate and label spaces be metric spaces $(\mathcal{X} ,\rho {\mathcal{X}}),!(\mathcal{Y} ,\rho {\mathcal{Y}})$, and $\mathcal{D} {XY}^{A}, \mathcal{D} {XY}^{B}$ are two joint distributions of covariates and labels on $\mathcal{X}!\times!\mathcal{Y}$. If the hypothesis $h:\mathcal{X} \rightarrow !\mathcal{Y} ^{'}$ is $L_h$-Lipschitz continuous, loss $\ell :\mathcal{Y} \times \mathcal{Y} ^{'}!!\rightarrow !\mathbb{R} $ is separately $(L{\ell},L{\ell}^{'})$-Lipschitz continuous, then: $$ \epsilon B(h)\le \epsilon A(h)+L_hL{\ell}^{'},S{Cov}+L{\ell},S{Cpt}^{\gamma ^*} $$

where $\epsilon _A(h), \epsilon _B(h)$ are the errors of hypothesis $h$ under the distributions $\mathcal{D} {XY}^{A}, \mathcal{D} {XY}^{B}$, respectively. $S{Cov}, S{Cpt}^{\gamma ^*}$ are covariate shift (=$X$ shift, distribution shift of covariates) and concept shift (=$Y|X$ shift, distribution shift of labels conditioned on covariates) between $\mathcal{D} _{XY}^{A}, \mathcal{D} _{XY}^{B}$. Both shifts are defined in closed form via entropic optimal transport.

This elegant theory shows how distribution shifts affect the error, and has the following advantages:

  • General: Because the theory assumes no particular loss or space, it applies broadly to losses and tasks—including regression, classification, and multi-label problems, as long as the covariate and label space of the problem can define metrics. Moreover, depending on whether the covariate space is the raw feature space or the model’s representation space, the theory can measure shifts in either the original data or the learned representations.

  • Estimable: Both covariate shift $S_{Cov}$ and concept shift $S_{Cpt}^{\gamma ^*}$ in the theory can be rigorously estimated from finite samples drawn from the two distributions—which is the core capability of this package.

For further theoretical details, please see our original paper.


Installation

Just use the following command to install DataShifts package:

pip install datashifts

Quick Example

import torch
from datashifts import DataShifts

# Generate data from two different distributions (take labels originating from pure noise as an example)
N=10000       #Number of samples
x_dim=200     #Feature dimensions
y_dim=10      #Label dimensions
x_shift=10.0  #True covariate shift
device="cuda" #Device

random_directions=torch.randn(1, x_dim, device=device)
x_shift_vector=random_directions/((random_directions**2).sum()**(1/2))*x_shift
# First distribution
x1 = torch.randn(N, x_dim, device=device)
y1= torch.rand(N, y_dim, device=device)
# Second distribution
x2 = torch.randn(N, x_dim, device=device)+x_shift_vector
y2= torch.rand(N, y_dim, device=device)

# Using DataShifts to quantify covariate and concept shifts
covariate_shift, concept_shift=DataShifts(x1, x2, y1, y2)
print("Covariate shift: ", covariate_shift)
print("Concept shift: ",   concept_shift  )

Typical output

The sample size of (x1,y1,w1) is larger than parameter 'N_max'=5000, sampling strategy is used.
The sample size of (x2,y2,w2) is larger than parameter 'N_max'=5000, sampling strategy is used.
Covariate shift: tensor(9.9608, device='cuda:0')
Concept shift:   tensor(1.2627, device='cuda:0')

datashifts.DataShifts —  Measure Covariate  &  Concept Shift between Distributions from Samples

datashifts.DataShifts is the core method of the DataShifts package, which estimates covariate shift and concept shift from finite labeled samples (x1,y1), (x2,y2) drawn from two distributions, with automatic sub‑sampling for scalability and GPU acceleration.

covariate_shift, concept_shift = DataShifts(
            x1, x2, y1, y2,                    # required
            weights1=None, weights2=None,      # optional importance weights
            eps=0.01,                          # entropic regularisation
            N_max=5000,                        # max points kept per distribution
            device=None,                       # "cpu", "cuda" or None (auto)
            seed=None,                         # random seed for reproducibility
            verbose=True                       # print progress messages
)

Note (temporary): For now, Euclidean distance is the only built-in metric. Custom metrics are planned.

Parameters

name type default description
x1, x2 torch.Tensor or numpy.ndarray Covariates of the samples drawn from two distributions.
Shapes accepted:(Batch_size, Num_samples, Dim_x) or (Num_samples, Dim_x)
y1, y2 torch.Tensor or numpy.ndarray Corresponding labels.
Shapes accepted:(Batch_size, Num_samples, Dim_y) or (Num_samples, Dim_y). Must match x* in Batch_size and Num_samples dimensions.
weights1, weights2 torch.Tensor or numpy.ndarray None Sample weights.
Shapes accepted:(Batch_size, Num_samples) or (Num_samples). Must match x* in Batch_size and Num_samples dimensions.
eps float 0.01 Entropic regularisation for optimal transport. Smaller => more precise but slower.
N_max int 5000 Upper bound on samples per distribution kept for optimal transport. If N>N_max, the function resamples without replacement to speed up the solution (weighted if weights* provided). Larger => more precise but slower.
device str None Running device.
"cpu", "cuda"/"gpu", or None(= automatically use GPU if available).
seed int None Random seed for shuffling and sampling.
verbose bool True Whether to print progress messages (sampling or automatic device choice).

Returns

covariate_shift : torch.Tensor
concept_shift   : torch.Tensor

Returned objects are PyTorch tensors placed on the chosen device.


Licensing, Citation, Academic Use

This package is released under the MIT License. See the LICENSE file for full details.

If you use this package in a research paper, please cite our original paper:

@article{chen2025general,
  title={General and Estimable Learning Bound Unifying Covariate and Concept Shifts},
  author={Chen, Hongbo and Xia, Li Charlie},
  journal={arXiv preprint arXiv:2506.12829},
  year={2025}
}

Contributions & issues welcome at https://github.com/DataShifts/datashifts/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datashifts-0.8.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datashifts-0.8.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file datashifts-0.8.0.tar.gz.

File metadata

  • Download URL: datashifts-0.8.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.12

File hashes

Hashes for datashifts-0.8.0.tar.gz
Algorithm Hash digest
SHA256 c2eb51d9ba848326463beddcd6c1e5b0d4604d4253d1f7255786ff8ccafb82d6
MD5 7a0110b66eb98909edd0cd650f8e4990
BLAKE2b-256 acc721404fe20b4bd7f2981a24e23942bf294897ad67e15a8a18faf543df7020

See more details on using hashes here.

File details

Details for the file datashifts-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: datashifts-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.12

File hashes

Hashes for datashifts-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af46c3e82a38591d5f0f24ce85f15e2d11689a869c8b2d1352968a60d94ba672
MD5 68fccb414cf53111eb9af00cace7fa89
BLAKE2b-256 f1f84317d8a1e9d4e2cae869cce37391f9fa5d4ec73b78bb860ad8ccdb291cbb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page