A typed Active Learning Library
Project description
Active Learning and Technology-Assisted Review library for Python
python-allib
is a library that enables efficient data annotation with
Active Learning on various types of datasets. Through the
libraryinstancelib
we support
various machine learning algorithms and instance types. Besides
canonical Active Learning, this library offers Technology-Assisted
Review methods, which aid in making High-Recall Information Retrieval
tasks more efficient.
© Michiel Bron, 2024
Quick tour of Technology-Assisted Review simulation
Load dataset
Load the dataset in an instancelib
environment.
# Some imports
from pathlib import Path
from allib.benchmarking.datasets import TarDataset, DatasetType
POS = "Relevant"
NEG = "Irrelevant"
# Load a dataset in SYNERGY/ASREVIEW format
dataset_description = TarDataset(
DatasetType.REVIEW,
Path("./allib/tests/testdataset.csv"))
# Get an instancelib Environment object
ds = dataset_description.env
ds
Environment(dataset=InstanceProvider(length=2019),
labels=LabelProvider(labelset=frozenset({'Relevant', 'Irrelevant'}),
length=0,
statistics={'Relevant': 0, 'Irrelevant': 0}),
named_providers={},
length=2019,
typeinfo=TypeInfo(identifier=int, data=str, vector=NoneType, representation=str))
The ds
object is currently loaded in TAR simulation mode. This means,
that like the at the start of review process, there is no labeled data.
This is visible in the statistics in the ds
objects. However, as this
is simulation mode, there is a ground truth available. This can be
accessed as follows:
ds.truth
LabelProvider(labelset=frozenset({'Relevant', 'Irrelevant'}),
length=2019,
statistics={'Relevant': 101, 'Irrelevant': 1918})
In Active Learning, we are dealing with a partially labeled dataset.
There are two InstanceProvider
objects inside the ds
object that
maintain the label status:
print(f"Unlabeled: {ds.unlabeled}, Labeled: {ds.labeled}")
Unlabeled: InstanceProvider(length=2019), Labeled: InstanceProvider(length=0)
Basic operations
The ds
object supports all instancelib
operations, for example,
dividing the dataset in a train and test set.
train, test = ds.train_test_split(ds.dataset, train_size=0.70)
print(f"Train: {train}, Test: {test}")
Train: InstanceProvider(length=1413), Test: InstanceProvider(length=606)
Train a ML model
We can also train Machine Learning methods on the ground truth data in
ds.truth
.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from instancelib.analysis.base import prediction_viewer
import instancelib as il
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression()),
])
model = il.SkLearnDataClassifier.build(pipeline, ds)
model.fit_provider(train, ds.truth)
With the method prediction_viewer
we can view the predictions as a
Pandas dataframe.
# Show the three instances with the highest probability to be Relevant
df = prediction_viewer(model, test, ds.truth).sort_values(
by="p_Relevant", ascending=False
)
df.head(3)
0%| | 0/4 [00:00<?, ?it/s]
0%| | 0/4 [00:00<?, ?it/s]
data | label | prediction | p_Irrelevant | p_Relevant | |
---|---|---|---|---|---|
175 | A randomized trial of a computer-based interve... | Relevant | Irrelevant | 0.639552 | 0.360448 |
1797 | Computerized decision support to reduce potent... | Relevant | Irrelevant | 0.737130 | 0.262870 |
956 | Improvement of intraoperative antibiotic proph... | Relevant | Irrelevant | 0.762803 | 0.237197 |
Although the predicition probabilities are below 0.50, some of the top ranked documents have a ground truth label relevant.
Active Learning
We can integrate the model in an Active Learning method. A simple TAR method is AutoTAR.
from allib.activelearning.autotar import AutoTarLearner
al = AutoTarLearner(ds, model, POS, NEG, k_sample=100, batch_size=20)
To kick-off the process, we need some labeled data. Let’s give it some training data.
pos_instance = al.env.dataset[28]
neg_instance = al.env.dataset[30]
al.env.labels.set_labels(pos_instance, POS)
al.env.labels.set_labels(neg_instance, NEG)
al.set_as_labeled(pos_instance)
al.set_as_labeled(neg_instance)
Next, we can retrieve the instance that should be labeled next with the following command.
next_instance = next(al)
# next_instance is an Instance object.
# Representation contains a human-readable string version of the instance
print(
f"{next_instance.representation[:60]}...\n"
f"Ground Truth Labels: {al.env.truth[next_instance]}"
)
0%| | 0/11 [00:00<?, ?it/s]
Oral quinolones in hospitalized patients: an evaluation of a...
Ground Truth Labels: frozenset({'Relevant'})
Simulation
Using the ground truth data, we can further simulate the TAR process in an automated fashion:
from allib.stopcriterion.heuristic import AprioriRecallTarget
from allib.analysis.tarplotter import TarExperimentPlotter
from allib.analysis.experiments import ExperimentIterator
from allib.analysis.simulation import TarSimulator
recall95 = AprioriRecallTarget(POS, 0.95)
recall100 = AprioriRecallTarget(POS, 1.0)
criteria = {
"Perfect95": recall95,
"Perfect100": recall100,
}
# Then we can specify our experiment
exp = ExperimentIterator(al, POS, NEG, criteria, {})
plotter = TarExperimentPlotter(POS, NEG)
simulator = TarSimulator(exp, plotter)
simulator.simulate()
plotter.show()
0%| | 0/2019 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
0%| | 0/8 [00:00<?, ?it/s]
0%| | 0/8 [00:00<?, ?it/s]
0%| | 0/7 [00:00<?, ?it/s]
0%| | 0/7 [00:00<?, ?it/s]
0%| | 0/7 [00:00<?, ?it/s]
0%| | 0/6 [00:00<?, ?it/s]
0%| | 0/5 [00:00<?, ?it/s]
0%| | 0/5 [00:00<?, ?it/s]
0%| | 0/4 [00:00<?, ?it/s]
0%| | 0/3 [00:00<?, ?it/s]
0%| | 0/2 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
Command Line Interface
Besides importing the library, the code can be used to run some predefined experiments.
For a CSV in SYNERGY format:
python -m allib benchmark -m Review -d ./path/to/dataset -t ./path/to/results/ -e AUTOTAR -r 42
For a dataset in TREC-style:
python -m allib benchmark -m Trec -d ./path/to/dataset/ -t ./path/to/results/ -e AUTOTAR -r 42
Experiment options are:
AUTOTAR
AUTOSTOP
CHAO
TARGET
CMH
The -r
option is used to supply a seed value that is given to a random
generator.
Installation
See installation.md for an extended installation
guide, especially for enabling the CHAO
method. Short instructions are
below.
Method | Instructions |
---|---|
pip |
Install from PyPI via pip install python-allib . |
Local | Clone this repository and install via pip install -e . or locally run python setup.py install . |
Releases
python-allib
is officially released through
PyPI.
See CHANGELOG for a full overview of the changes for each version.
Citation
Use this bibtex to cite this package, or go to ZENODO, to cite a specific version.
@software{bron_2024_108698682,
author = {Bron, Michiel},
title = {Python Package python-allib},
month = mar,
year = 2024,
publisher = {Zenodo},
version = {0.5.1},
doi = {10.5281/zenodo.10869682},
url = {https://doi.org/10.5281/zenodo.10869682}
}
Maintenance
Contributors
- Michiel Bron (
@mpbron
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file python-allib-0.5.2.tar.gz
.
File metadata
- Download URL: python-allib-0.5.2.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d041bbedf4737085c59bfbc8b2525f1c6485052ff20d940a25002c718f4b23f |
|
MD5 | 7ea3641429f336f13ba494b3888bb6f9 |
|
BLAKE2b-256 | c0a242f66e1b7cebf9f234440b54e634a05db387f110a862bcb2abebea793d1c |
File details
Details for the file python_allib-0.5.2-py3-none-any.whl
.
File metadata
- Download URL: python_allib-0.5.2-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aba98855c848136eaf79df58f230849ed578dd3ffd7544f4b2fa6ae665b98ce |
|
MD5 | fb3f9ae85c27c90de0cb9d403f76efe7 |
|
BLAKE2b-256 | 750a2da29102a094a3baa50b85f2ab71a2abb112626b1fe5a881bebef2f9256c |