Vector Quantile Regression
Project description
vqr
: Fast Nonlinear Vector Quantile Regression
This package provides the first scalable implementation of Vector Quantile Regression (VQR), ready for large real-world datasets. In addition, it provides a powerful extension which makes VQR non-linear in the covariates, via a learnable transformation. The package is easy to use via a familiar sklearn
-style API.
Refer to our paper[^VQR2022] for further details about nonlinear VQR, and please cite our work if you use this package:
@article{rosenberg2022fast,
title={Fast Nonlinear Vector Quantile Regression},
author={Rosenberg, Aviv A and Vedula, Sanketh and Romano, Yaniv and Bronstein, Alex M},
journal={arXiv preprint arXiv:2205.14977},
year={2022}
}
Brief background and intuition
Quantile regression[^Koenker1978] (QR) is a well-known method which estimates a
conditional quantile of a target variable
Vector quantiles extend the notion of quantiles to high-dimensional variables [^Carlier2016].
Vector quantile regression (VQR) is the estimation of the conditional vector quantile function
VQR is a highly general approach, as it allows for assumption-free estimation of the conditional vector quantile function, which is a fundamental quantity that fully represents the distribuion of
Below is an illustration of vector quantiles of a
- Data is sampled uniformly from a 2d star-shaped region (middle, gray dots).
- Vector quantiles are overlaid on their data distribution (middle, colored dots).
- The vector quantile function (VQF)
is a mapping, which satisfies:- Strong representation:
where . - Co-monotonicity:
.
- Strong representation:
- Different colors correspond to
-contours, each containing percent of the data, a generalization of confidence intervals for vector-valued variables.- For example, for
, roughly 92% of the data is contained within the contour. - The shape of the distribution is correctly modelled, without any distributional assuptions.
- For example, for
- For
and , the components of the VQF are depicted as surfaces (left, right) with the corresponding vector quantiles overlaid.- On
, increasing for a fixed produces a monotonically increasing curve. - This corresponds to a quantile function for
given that is at a value corresponding to its -th quantile (and vice versa for ).
- On
Results and Comparisons
Non-linear VQR
Nonlinear VQR (NL-VQR) outperformes linear VQR and Conditional VAE (C-VAE)[^Feldman2021] on challenging distribution estimation tasks. The metric shown is KDE-L1 distribution distance (lower is better). Comparisons on two synthetic datasets are shown belows.
Conditional banana: In this dataset both the mean of the distribution and its shape change as a nonlinear function of the covariates
Rotating stars: Features a nonlinear relationship between the covariates and the quantile function (a rotation matrix), where the conditional mean remains the same for any
Non-linear Scalar QR
The Nonlinear VQR implementation in this package can be used for performing scalar, i.e.
Synthetic glasses: A bi-modal distribution in which the modes' distance depends on
Features
- Vector quantile estimation (VQE): Given samples of a vector-valued random variable
, estimate its vector quantile function . - Vector quantile regression (VQR): Given samples from a joint distribution of
where contains covariates ("feature vector") and is the target variable, estimate the conditional vector quantile function (CVQF). - Vector monotone rearrangement (VMR): an optional refinement procedure for estimated CVQFs which guarantees that the output is a valid quantile function, with no co-monotonicity violations.
- Support for arbitrary learnable non-linear functions of the covariates
, where the parameters are fitted jointly with the VQR model. Can provide anypytorch
model as the learnable transformation. - Sampling: After fitting VQR, new samples can be generated from the conditional distribution. Thus VQR can be used as a generative model which can be fitted on samples, without making any distributional assumptions.
- Calculating quantile
-contours: the equivalent of -confidence regions for high-dimensional data. - Works for any
. Specifically, for , provides an incredibly fast method for performing nonlinear scalar quantile regression which estimates multiple quantiles of the target variable simultaneously. - Multiple solvers supported as backends. The VQE/VQR API can work with different solver implementations which can provide different benefits and tradeoffs. Easy to integrate new solvers.
- GPU support.
- Coverage and area calculation: measures whether samples are within some
-contour of the fitted quantile function, and also the area of these contours. - Plotting: Basic capabilities for plotting 2d and 3d quantile functions.
Installation
Simply install the vqr
package via pip
:
pip install vqr
To run the example notebooks, please clone this repo and install the supplied conda
environment.
conda env update -f environment.yml -n vqr
conda activate vqr
Usage examples
Below is a minimal usage example for VQR, demonstrating fitting linear VQR, sampling from the conditional distribution, and calculating coverage at a specified
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from vqr import VectorQuantileRegressor
from vqr.solvers.regularized_lse import RegularizedDualVQRSolver
N, d, k, T = 5000, 2, 1, 20
N_test = N // 10
seed = 42
alpha = 0.05
# Generate some data (or load from elsewhere).
X, Y = make_regression(
n_samples=N, n_features=k, n_targets=d, noise=0.1, random_state=seed
)
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=N_test, shuffle=True, random_state=seed
)
# Create the VQR solver and regressor.
vqr_solver = RegularizedDualVQRSolver(
verbose=True, epsilon=1e-2, num_epochs=1000, lr=0.9
)
vqr = VectorQuantileRegressor(n_levels=T, solver=vqr_solver)
# Fit the model on the data.
vqr.fit(X_train, Y_train)
# Marginal coverage calculation: for each test point, calculate the
# conditional quantiles given x, and check whether the corresponding y is covered
# in the alpha-contour.
cov_test = np.mean(
[vqr.coverage(Y_test[[i]], X_test[[i]], alpha=alpha) for i in range(N_test)]
)
print(f"{cov_test=}")
# Sample from the fitted conditional distribution, given a specific x.
Y_sampled = vqr.sample(n=100, x=X_test[0])
# Calculate conditional coverage given a sample x.
cov_sampled = vqr.coverage(Y_sampled, x=X_test[0], alpha=alpha)
print(f"{cov_sampled=}")
For further examples, please fefer to the example notebooks in the notebooks/
folder of this repo.
References
[^Koenker1978]: Koenker, R. and Bassett Jr, G., 1978. Regression quantiles. Econometrica: journal of the Econometric Society, pp.33-50. [^Carlier2016]: Carlier, G., Chernozhukov, V. and Galichon, A., 2016. Vector quantile regression: an optimal transport approach. The Annals of Statistics, 44(3), pp.1165-1192. [^VQR2022]: Rosenberg, A.A., Vedula, S., Romano, Y. and Bronstein, A.M., 2022. Fast Nonlinear Vector Quantile Regression. arXiv preprint arXiv:2205.14977. [^Feldman2021]: Feldman, S., Bates, S. and Romano, Y., 2021. Calibrated multiple-output quantile regression with representation learning. arXiv preprint arXiv:2110.00816.