Quantification using Kernel Embedding and Random Fourier Features. On GPU and CPU.
Project description
Foobar
KernelQuantification
used Kernel Mean Embedding to estimate the proportion of a mixture of distributions.
Overview
KernelQuantification
is a method to do quantification learning, i.e. estimate proportions of labels in a target sample, using Kernel Mean Embedding and Random Fourier Features. This python implementation can run on CPU or on GPU if cuda is installed.
The methods implemented in this package are detailed in the following article:
ARTICLE NOT YET AVAILABLE
The code associated to the article can be found here:
CODE NOT YET AVAILABLE
Installation
Use the package manager pip to install Kernel Quantifier from pypi.
pip install kernelquantifier
You will require CUDA to use the algorithms on your GPU.
Note that the KernelQuantifier can be use on CPU however Generative KernelQuantifier requires a GPU to be efficient.
Example
We test our methods on a toy dataset: Iris flower dataset contains in the scikit-learn package.
All the code can be found at ./tests/example.ipynb
.
Package
import kernelquantifier as kq
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
Data
We load the data and we use apply a PCA on the data to plot it. Code taken from the scikit-learn documentation
iris = load_iris()
X = iris.data
y = iris.target
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
colors = ["navy", "turquoise", "darkorange"]
plt.figure(figsize=(8, 8))
for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):
plt.scatter(
X_pca[y == i, 0],
X_pca[y == i, 1],
color=color,
lw=2,
label=target_name,
)
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.axis([-4, 4, -1.5, 1.5])
plt.savefig("figures/iris_pca.png")
plt.show()
We split the dataset in two part, one will be the Source and the other the Target. We don't have acces to the label of the Target dataset.
np.random.seed(123)
idx = np.random.choice(np.arange(X.shape[0]), 75, replace=False)
idx_inv = []
for i in range(X.shape[0]):
if i not in idx:
idx_inv.append(i)
# Target
Target = X[idx]
Target_Label = y[idx]
_, counts = np.unique(Target_Label, return_counts=True)
pi = counts/Target_Label.shape[0]
# Source
Source = X[idx_inv]
Source_Label = y[idx_inv]
# Plot
Source_pca = pca.transform(Source)
Target_pca = pca.transform(Target)
colors = ["navy", "turquoise", "darkorange"]
plt.figure(figsize=(8, 8))
for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):
plt.scatter(
Source_pca[Source_Label == i, 0],
Source_pca[Source_Label == i, 1],
color=color,
lw=2,
label=target_name,
)
plt.scatter(
Target_pca[:, 0],
Target_pca[:, 1],
color="gray",
lw=1,
label="Target",
)
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.axis([-4, 4, -1.5, 1.5])
plt.show()
Preparing the data
We transform the target into a torch.tensor.
Target = torch.from_numpy(Target)
We have implemented a class kernelquantifier.LabelledCollection
to manage the Source data. This is essentially a list of tensor, where each index correspond to one class, with a few more methods implemented (see the documentation for more information).
To create kernelquantifier.LabelledCollection
we need to pass as argument a function that will transform the data as a list and the corresponding arguments (see the documentation for more information).\
For instance in our case the Source X
come with label y
. We create a function to_labelledCollection
that takes the data and the labels as arguments and returns a list where index 0 contains the point belonging to the class 0 and so on.
Side note on the device:
The computation can be done either on the CPU or the GPU. To be performed on the GPU you will need CUDA installed.
The device the computation will be done on, is the same device as the Source (If Target and Source are not on the same device, the Target will be moved).
We have implemented two functions : kernelquantifier.choose_device
that will select the gpu is avalaible and the cpu otherwise. And kernelquantifier.cuda_info
that will give informations on the device.
device = kq.choose_device(verbose=True)
# device = torch.device("cpu")
kq.cuda_info(device)
outputs:
Running on the GPU
Using device: cuda:0
NVIDIA RTX A2000 Laptop GPU
Memory Usage:
Allocated: 0.0 GB
Cached: 0.0 GB
11.3
def to_labelledCollection(X, y):
return [torch.from_numpy(X[y == i, :]).to(device) for i in range(3)]
Source = kq.LabelledCollection(to_labelledCollection, Source, Source_Label)
KernelQuantifier
There is two version of the Kernel Quantifier algorithm. The one that use Random Fourier Features and the one that don't. See notebook computation_times.ipynb
or the article to see the difference between the two.
With RandomFourierFeatures
To use the KernelQuantifier with RFF, we call the class kernelquantifier.KernelQuantifierRFF
(see the docs for more information).\
The init as two arguments: kernel_type (str) and seed (int). The seed is there to ensure the reproducibility of the experiments. While the kernel_type specify the kernel we want to use. Use the function kernelquantifier.available_kernel_rff
to see the list of currently available kernel with RFF.\
Then we have to fit our kernel. The method KernelQuantifierRFF.fit
is used to store the RFF given a sigma and the number of RFF. If sigma is a range and not a float, then the method will compute the optimal sigma according to our theoretical criterion (see the article).
quantifier = kq.KernelQuantifierRFF(kernel_type="gaussian", seed=123)
quantifier.fit(Source, sigma=[0.1, 2], verbose=True, number_rff=1000)
return : Sigma = 0.9530612244897959
We now can quantify using the method KernelQuantifierRFF.quantify
. This method take as argument the Source (as a LabelledCollection) and the Target. It returns the estimate proportion as an numpy.array
. We print the estimated proportions pi_hat
and the true proportions computed earlier pi
. Finally we used kernelquantifier.KL_divergence
to compute the KL divergence times 100 between the two vectors.
pi_hat = quantifier.quantify(Source, Target)
print(f"pi_hat = {pi_hat} \npi = {pi}")
print(f"Error = {kq.KL_divergence(pi_hat, pi)}")
outputs:
pi_hat = [0.36557279 0.31453992 0.31988729]
pi = [0.36 0.26666667 0.37333333]
Error = 0.8126347074855538
Without RandomFourierFeatures
The version without RFF is similar to the one with RFF.\
We creat an instance of the class kernelquantifier.KernelQuantifier
. Use the function kernelquantifier.available_kernel
to see the list of currently available kernel without RFF. We then use the method KernelQuantifier.fit
witout specifying the number of RFF.
quantifier = kq.KernelQuantifier(kernel_type="gaussian", seed=123)
quantifier.fit(Source, sigma=[0.1, 2], verbose=True)
We then quantify using the method KernelQuantifier.quantify
with the same arguments in input as before.
assert isinstance(Source.data_,list)
pi_hat = quantifier.quantify(Source.data_, Target) # quantifier.quantify(Source, Target)
print(f"pi_hat = {pi_hat} \npi = {pi}")
print(f"Error = {kq.KL_divergence(pi_hat, pi)}")
outputs:
pi_hat = [0.36641991 0.3124038 0.32117629]
pi = [0.36 0.26666667 0.37333333]
Error = 0.759835032840004
Generative KernelQuantifier
Generative KernelQuantifier was designed to deal with the Generalised Label Shift Hypothesis. This is not the case here. We don't need to use GKQuant. Nevertheless we will use it as an example.\
We create an instance of class kernelquantifier.GenerativeKernelQuantifier
. We specify the kernel_type (use kernelquantifier.available_kernel_rff
to see the list of available kernel), and the generator type. Use kernelquantifier.available_generator
to see the list of available generator.\
Currently three:
- sharelinear: $g_i(x) = Ax + b_i$
- independantlinear $g_i(x) = A_i x + b_i$
- translation $g_i(x) = x + b_i$ $A \text{and} A_i$ are diagonal matrix.
We fit the sameway we fitted the kernelquantifier.KernelQuantifier
.
quantifier = kq.GenerativeKernelQuantifier(
kernel_type="gaussian",
generator_type="sharelinear",
seed=123)
quantifier.fit(Source, sigma=1., verbose=True, number_rff=1000)
The method quantify has several parameters (see the pseudo-code in the article for more details and the docs).
parameter = {"n_epoch": 20,
"n_epochGM": 10,
"lr": 0.001,
"verbose": True,
"initial_prop" : None}
pi_hat = quantifier.quantify(Source, Target, **parameter)
print(f"pi_hat = {pi_hat} \npi = {pi}")
print(f"Error = {kq.KL_divergence(pi_hat, pi)}")
outputs:
pi_hat = [0.35858863 0.30174921 0.33966216]
pi = [0.36 0.26666667 0.37333333]
Error = 0.3781702910014444
Contributing
Pull requests are welcome.
License
KernelQuantifier is under the MIT licence. See the LICENCE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kernelquantifier-0.0.1.tar.gz
.
File metadata
- Download URL: kernelquantifier-0.0.1.tar.gz
- Upload date:
- Size: 328.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7b0bee6ca9abc4a628cc7b891a382d07f92c0a05204ab95f776068dafcabced |
|
MD5 | 7c6f430b68b717364626057a287a05c2 |
|
BLAKE2b-256 | 8f67c4df621e697139452716e8fd3f3d950611d9fff61488fc55f18c2aac4439 |
File details
Details for the file kernelquantifier-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: kernelquantifier-0.0.1-py3-none-any.whl
- Upload date:
- Size: 30.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7956302a5d3430afefd5dce1a5f55dbf88f7b4b1d1d795dbf0c965d7afb4b99 |
|
MD5 | 18921178688417ed31de4600181243cf |
|
BLAKE2b-256 | 35b72c01e088fef28ee0f57c69d9237a5ff277b1aa8a059fa38dee9745b8db3d |