Skip to main content

biquality-learn is a library à la scikit-learn for Biquality Learning.

Project description

biquality-learn

main codecov versions pypi

biquality-learn (or bqlearn in short) is a library à la scikit-learn for Biquality Learning.

Biquality Learning

Biquality Learning is a machine learning framework to train classifiers on Biquality Data, where the dataset is split into a trusted and an untrusted part:

  • The trusted dataset contains trustworthy samples with clean labels and proper feature distribution.
  • The untrusted dataset contains potentially corrupted samples from label noise or covariate shift (distribution shift).

biquality-learn aims at making well-known and proven biquality learning algorithms accessible and easy to use for everyone and enabling researchers to experiment in a reproducible way on biquality data.

Install

biquality-learn requires multiple dependencies:

  • numpy>=1.17.3
  • scipy>=1.5.0
  • scikit-learn>=1.3.0
  • scs>=3.2.2

The package is available on PyPi. To install biquality-learn, run the following command :

pip install biquality-learn

A dev version is available on TestPyPi :

pip install --index-url https://test.pypi.org/simple/ biquality-learn

Quick Start

For a quick example, we are going to train one of the available biquality classifiers, KPDR, on the digits dataset with synthetic asymmetric label noise.

Loading Data

First, we must load the dataset with scikit-learn and split it into a trusted and untrusted dataset.

from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedShuffleSplit

X, y = load_digits(return_X_y=True)

trusted, untrusted = next(StratifiedShuffleSplit(train_size=0.1).split(X, y))

Simulating Label Noise

Then we generate label noise on the untrusted dataset.

from bqlearn.corruption import make_label_noise

y[untrusted] = make_label_noise(y[untrusted], "flip", noise_ratio=0.8)

Training Biquality Classifier

Finally, we train KKMM on the biquality dataset by providing the sample_quality metadata, indicating if a sample is trusted or untrusted.

from sklearn.linear_models import LogisticRegression
from bqlearn.density_ratio import KKMM

bqclf = KKMM(LogisticRegression(), kernel="rbf")

sample_quality = np.ones(X.shape[0])
sample_quality[untrusted] = 0

bqclf.fit(X, y, sample_quality=sample_quality)
bqclf.predict(X)

Citation

If you use biquality-learn in your research, please consider citing us :

@misc{nodet2023biqualitylearn,
      title={biquality-learn: a Python library for Biquality Learning}, 
      author={Pierre Nodet and Vincent Lemaire and Alexis Bondu and Antoine Cornuéjols},
      year={2023},
      eprint={2308.09643},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgment

This work has been funded by Orange Labs.

Orange Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biquality-learn-0.1.0.tar.gz (76.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biquality_learn-0.1.0-py3-none-any.whl (71.0 kB view details)

Uploaded Python 3

File details

Details for the file biquality-learn-0.1.0.tar.gz.

File metadata

  • Download URL: biquality-learn-0.1.0.tar.gz
  • Upload date:
  • Size: 76.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for biquality-learn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 503d7b22551a6cdf73f8351cd40e5c52543017efaf310e2df705d4a3606dc013
MD5 a09b0a89d13ca762e1dddd8fe2176b21
BLAKE2b-256 6a129ecff4b397cb527772bc288d0c1d4cc72e71c8297cfb5eaa6958da2060e8

See more details on using hashes here.

File details

Details for the file biquality_learn-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for biquality_learn-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 054f4370d07e17bc9033967dcf260b031b74dfa28ca37f30555b10c82748e107
MD5 6f9b60be806b8a21ff59cbe58cfcfca6
BLAKE2b-256 b84ab4c1b1575c89f6de93192e6ce393c3f6585d1c3aa852442ee9320770b8ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page