A typed Active Learning Library

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Active Learning and Technology-Assisted Review library for Python

python-allib is a library that enables efficient data annotation with Active Learning on various types of datasets. Through the libraryinstancelib we support various machine learning algorithms and instance types. Besides canonical Active Learning, this library offers Technology-Assisted Review methods, which aid in making High-Recall Information Retrieval tasks more efficient.

Quick tour of Technology-Assisted Review simulation

Load dataset

Load the dataset in an instancelib environment.

# Some imports
from pathlib import Path
from allib.benchmarking.datasets import TarDataset, DatasetType

POS = "Relevant"
NEG = "Irrelevant"
# Load a dataset in SYNERGY/ASREVIEW format
dataset_description = TarDataset(
  DatasetType.REVIEW, 
  Path("./allib/tests/testdataset.csv"))

# Get an instancelib Environment object
ds = dataset_description.env

ds

Environment(dataset=InstanceProvider(length=2019), 
   labels=LabelProvider(labelset=frozenset({'Relevant', 'Irrelevant'}), 
   length=0, 
   statistics={'Relevant': 0, 'Irrelevant': 0}), 
   named_providers={}, 
   length=2019, 
   typeinfo=TypeInfo(identifier=int, data=str, vector=NoneType, representation=str))

The ds object is currently loaded in TAR simulation mode. This means, that like the at the start of review process, there is no labeled data. This is visible in the statistics in the ds objects. However, as this is simulation mode, there is a ground truth available. This can be accessed as follows:

ds.truth

LabelProvider(labelset=frozenset({'Relevant', 'Irrelevant'}), 
   length=2019, 
   statistics={'Relevant': 101, 'Irrelevant': 1918})

In Active Learning, we are dealing with a partially labeled dataset. There are two InstanceProvider objects inside the ds object that maintain the label status:

print(f"Unlabeled: {ds.unlabeled}, Labeled: {ds.labeled}")

Unlabeled: InstanceProvider(length=2019), Labeled: InstanceProvider(length=0)

Basic operations

The ds object supports all instancelib operations, for example, dividing the dataset in a train and test set.

train, test = ds.train_test_split(ds.dataset, train_size=0.70)
print(f"Train: {train}, Test: {test}")

Train: InstanceProvider(length=1413), Test: InstanceProvider(length=606)

Train a ML model

We can also train Machine Learning methods on the ground truth data in ds.truth.

from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from instancelib.analysis.base import prediction_viewer
import instancelib as il
pipeline = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegression()),
     ])

model = il.SkLearnDataClassifier.build(pipeline, ds)
model.fit_provider(train, ds.truth)

With the method prediction_viewer we can view the predictions as a Pandas dataframe.

# Show the three instances with the highest probability to be Relevant
df = prediction_viewer(model, test, ds.truth).sort_values(
    by="p_Relevant", ascending=False
)
df.head(3)

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

	data	label	prediction	p_Irrelevant	p_Relevant
175	A randomized trial of a computer-based interve...	Relevant	Irrelevant	0.639552	0.360448
1797	Computerized decision support to reduce potent...	Relevant	Irrelevant	0.737130	0.262870
956	Improvement of intraoperative antibiotic proph...	Relevant	Irrelevant	0.762803	0.237197

Although the predicition probabilities are below 0.50, some of the top ranked documents have a ground truth label relevant.

Active Learning

We can integrate the model in an Active Learning method. A simple TAR method is AutoTAR.

from allib.activelearning.autotar import AutoTarLearner

al = AutoTarLearner(ds, model, POS, NEG, k_sample=100, batch_size=20)

To kick-off the process, we need some labeled data. Let’s give it some training data.

pos_instance = al.env.dataset[28]
neg_instance = al.env.dataset[30]
al.env.labels.set_labels(pos_instance, POS)
al.env.labels.set_labels(neg_instance, NEG)
al.set_as_labeled(pos_instance)
al.set_as_labeled(neg_instance)

Next, we can retrieve the instance that should be labeled next with the following command.

next_instance = next(al)
# next_instance is an Instance object.
# Representation contains a human-readable string version of the instance
print(
    f"{next_instance.representation[:60]}...\n"
    f"Ground Truth Labels: {al.env.truth[next_instance]}"
)

  0%|          | 0/11 [00:00<?, ?it/s]

Oral quinolones in hospitalized patients: an evaluation of a...
Ground Truth Labels: frozenset({'Relevant'})

Simulation

Using the ground truth data, we can further simulate the TAR process in an automated fashion:

from allib.stopcriterion.heuristic import AprioriRecallTarget
from allib.analysis.tarplotter import TarExperimentPlotter
from allib.analysis.experiments import ExperimentIterator
from allib.analysis.simulation import TarSimulator

recall95 = AprioriRecallTarget(POS, 0.95)
recall100 = AprioriRecallTarget(POS, 1.0)
criteria = {
    "Perfect95": recall95,
    "Perfect100": recall100,
}

# Then we can specify our experiment
exp = ExperimentIterator(al, POS, NEG, criteria, {})
plotter = TarExperimentPlotter(POS, NEG)
simulator = TarSimulator(exp, plotter)

simulator.simulate()
plotter.show()

  0%|          | 0/2019 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Command Line Interface

Besides importing the library, the code can be used to run some predefined experiments.

For a CSV in SYNERGY format:

python -m allib benchmark -m Review -d  ./path/to/dataset -t ./path/to/results/ -e AUTOTAR -r 42

For a dataset in TREC-style:

python -m allib benchmark -m Trec -d  ./path/to/dataset/ -t ./path/to/results/ -e AUTOTAR -r 42

Experiment options are:

AUTOTAR
AUTOSTOP
CHAO
TARGET
CMH

The -r option is used to supply a seed value that is given to a random generator.

Installation

See installation.md for an extended installation guide, especially for enabling the CHAO method. Short instructions are below.

Method	Instructions
`pip`	Install from PyPI via `pip install python-allib`.
Local	Clone this repository and install via `pip install -e .` or locally run `python setup.py install`.

Releases

python-allib is officially released through PyPI.

See CHANGELOG for a full overview of the changes for each version.

Citation

Use this bibtex to cite this package, or go to ZENODO, to cite a specific version.

@software{bron_2024_108698682,
  author       = {Bron, Michiel},
  title        = {Python Package python-allib},
  month        = mar,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {0.5.1},
  doi          = {10.5281/zenodo.10869682},
  url          = {https://doi.org/10.5281/zenodo.10869682}
}

Maintenance

Contributors

Michiel Bron (@mpbron)

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.5.2

Mar 27, 2024

0.5.1

Mar 25, 2024

0.5.0

Mar 25, 2024

0.4.3.0

Dec 22, 2023

0.4.2.2

Dec 22, 2023

0.4.2.1

Dec 22, 2023

0.4.2.0

Dec 21, 2023

0.4.1.0

Nov 13, 2023

0.4.0.0

Aug 24, 2023

0.3.6.0

May 22, 2023

0.3.5.0

Apr 7, 2023

0.3.4.0

Mar 31, 2023

0.3.3.0

Jan 29, 2023

0.3.2.0

Jan 17, 2023

0.3.1.0

Jan 5, 2023

0.3.0.0

Jan 2, 2023

0.2.2.0

Nov 22, 2022

0.2.1.0

Nov 7, 2022

0.2.0.0

Nov 3, 2022

0.1.8.2

Oct 12, 2022

0.1.8.1

Sep 29, 2022

0.1.8.0

Sep 29, 2022

0.1.7.3

Jun 17, 2022

0.1.7.2

Mar 10, 2022

0.1.7.1

Mar 9, 2022

0.1.6.1

Oct 29, 2021

0.1.6.0

Oct 22, 2021

0.1.5.6

Oct 15, 2021

0.1.5.5

Oct 14, 2021

0.1.5.4

Aug 17, 2021

0.1.5.3

Aug 17, 2021

0.1.5.2

Aug 13, 2021

0.1.5.1

Aug 13, 2021

0.1.5.0

Aug 12, 2021

0.1.4.1

Aug 6, 2021

0.1.4.0

Aug 6, 2021

0.1.3.0

Aug 6, 2021

0.1.2.0

Aug 6, 2021

0.1.1.0

Aug 6, 2021

0.1.0.0

Jul 30, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-allib-0.5.2.tar.gz (1.3 MB view hashes)

Uploaded Mar 27, 2024 Source

Built Distribution

python_allib-0.5.2-py3-none-any.whl (1.4 MB view hashes)

Uploaded Mar 27, 2024 Python 3

Hashes for python-allib-0.5.2.tar.gz

Hashes for python-allib-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`0d041bbedf4737085c59bfbc8b2525f1c6485052ff20d940a25002c718f4b23f`
MD5	`7ea3641429f336f13ba494b3888bb6f9`
BLAKE2b-256	`c0a242f66e1b7cebf9f234440b54e634a05db387f110a862bcb2abebea793d1c`

Hashes for python_allib-0.5.2-py3-none-any.whl

Hashes for python_allib-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0aba98855c848136eaf79df58f230849ed578dd3ffd7544f4b2fa6ae665b98ce`
MD5	`fb3f9ae85c27c90de0cb9d403f76efe7`
BLAKE2b-256	`750a2da29102a094a3baa50b85f2ab71a2abb112626b1fe5a881bebef2f9256c`