biquality-learn is a library à la scikit-learn for Biquality Learning.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

biquality-learn

biquality-learn (or bqlearn in short) is a library à la scikit-learn for Biquality Learning.

Biquality Learning

Biquality Learning is a machine learning framework to train classifiers on Biquality Data, composed of an untrusted and a trusted dataset:

The trusted dataset contains trustworthy samples with clean labels and proper feature distribution.
The untrusted dataset contains potentially corrupted samples from label noise or covariate shift (distribution shift).

biquality-learn aims at making well-known and proven biquality learning algorithms accessible to everyone and help researchers experiment in a reproducible way on biquality data.

Install

biquality-learn requires multiple dependencies:

numpy>=1.17.3
scipy>=1.5.0
scikit-learn>=1.2.0
scs>=3.2.2

The package is available on PyPi. To install biquality-learn, run the following command :

pip install biquality-learn

A dev version is available on TestPyPi :

pip install --index-url https://test.pypi.org/simple/ biquality-learn

Quick Start

For a quick example, we are going to train one of the available biquality classifiers, KPDR, on the digits dataset with synthetic asymmetric label noise.

Loading Data

First, we must load the dataset with scikit-learn and split it into a trusted and untrusted dataset.

from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedShuffleSplit

X, y = load_digits(return_X_y=True)

trusted, untrusted = next(StratifiedShuffleSplit(train_size=0.1).split(X, y))

Simulating Label Noise

Then we generate label noise on the untrusted dataset.

from bqlearn.corruption import make_label_noise

y[untrusted] = make_label_noise(y[untrusted], "flip", noise_ratio=0.8)

Training Biquality Classifier

Finally, we train KKMM on the biquality dataset by providing the sample_quality metadata, indicating if a sample is trusted or untrusted.

from sklearn.linear_models import LogisticRegression
from bqlearn.density_ratio import KKMM

bqclf = KKMM(LogisticRegression(), kernel="rbf")

sample_quality = np.ones(X.shape[0])
sample_quality[untrusted] = 0

bqclf.fit(X, y, sample_quality=sample_quality)
bqclf.predict(X)

Citation

If you use biquality-learn in your research, please consider citing us :

@misc{todo,
      title={biquality-learn: a Python library for Biquality Learning}, 
      author={Pierre Nodet and Vincent Lemaire and Alexis Bondu and Antoine Cornuéjols},
      year={2023},
      eprint={todo},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgment

This work has been funded by Orange Labs.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.0

Sep 19, 2023

This version

0.0.2

Aug 22, 2023

0.0.1

Jun 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biquality-learn-0.0.2.tar.gz (75.6 kB view hashes)

Uploaded Aug 22, 2023 Source

Built Distribution

biquality_learn-0.0.2-py3-none-any.whl (70.4 kB view hashes)

Uploaded Aug 22, 2023 Python 3

Hashes for biquality-learn-0.0.2.tar.gz

Hashes for biquality-learn-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`53a2ebaebd0aa0f1783660db53149f59fa0ab530956fb973c4cffcf35e1b7194`
MD5	`e8f92b76993fbb55b56200bfadee7555`
BLAKE2b-256	`ae3bfc381aa6f8085a2f6eb1108a2e6e5729acb399359d935f0d4783677fd231`

Hashes for biquality_learn-0.0.2-py3-none-any.whl

Hashes for biquality_learn-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e94abd858ddfe1fb226441742ee271a23fae2def091d8f505a5063b048ed9e5`
MD5	`93519131a5811b5377c22b78cdafb8f3`
BLAKE2b-256	`e65f20ff0038cab80572815d865aa718c411247b1de4c9d46e395f76b9b25cfa`