Distributed Python Active Learning library
Project description
DPyACL
Distributed Python Framework for Active Learning
May 2020
Alfredo Lorie Bernardo
version 0.3
Introduction
DPyACL
is a flexible Distributed Active Learning library written in Python, aimed to make active learning experiments
simpler and faster. Its leverage Dask distributed features to execute active learning experiments computations among a
cluster of computers, allowing to speed up computation and tackle scenarios where data doesn't fit in a single computer.
It also has been developed with a modular object-oriented design to provide an intuitive, ease of use interface and
to allow reuse, modification, and extensibility. It also offers full compatibility with libraries like NumPy, SciPy,
Pandas, Scikit-learn and Keras. This library is available in PyPI and distributed under the GNU license.4
Up to date, DPyACL heavily uses Dask library to implement in a distributed and parallel fashion the the most significant strategies strategies that have appeared on the single_label-label. For next versions, we hope to include strategies strategies related with multi-label learning paradigms.
Download
GitHub: https://github.com/a24lorie/DPyACL
Using DPyACL
The fastest way to use DPyACL
is from a Jupyter Notebook.
Preparing an experiment
When defining an Active Learning experiment DPyACL
offers set pre-defined components that can be configured and
combined by the user to better fit its needs. The required components to setup and experiment are listed below
- The Dataset
- Labelled and unlabelled sets: Optional - The experiment might be configured to randomly choose an initial labeled and unlabeled sets
- An Experiment: HoldOut and KFold experiments are provided
- The AL scenario: The current release provides a Pool Based Scenario
- The Machine Learning Technique: It can be a machine learning technique from any library that provides an API compatible with the fit, predict and predict_proba definitions. Sklearn, Dask-ML, Keras are compatible
- The Evaluation Method(s)
- The Query Strategy
- The Stopping Criteria
- The Oracle: The current release provides a Simulated Oracle
Configuring the experiment
ml_technique = LogisticRegression(solver='liblinear')
stopping_criteria = MaxIteration(50)
query_strategy = QueryMarginSampling()
performance_metrics = [
Accuracy(),
F1(average='macro'),
Precision(average='macro'),
Recall(average='macro')]
experiment = HoldOutExperiment(
_X,
_y,
scenario_type=PoolBasedSamplingScenario,
train_idx=train_idx,
test_idx=test_idx,
label_idx=label_idx,
unlabel_idx=unlabel_idx,
ml_technique=ml_technique,
performance_metrics=performance_metrics,
query_strategy=query_strategy,
oracle=SimulatedOracleQueryIndex(labels=_y),
stopping_criteria=stopping_criteria,
self_partition=False
)
Execute the experiment
result = experiment.evaluate(verbose=True)
Analyze the experiment results
query_analyser = ExperimentAnalyserFactory.experiment_analyser(
performance_metrics= [metric.metric_name for metric in performance_metrics],
method_name=query_strategy.query_function_name,
method_results=result,
type="queries"
)
# get a brief description of the experiment
query_analyser.plot_learning_curves(title='Active Learning experiment results')
Contribution
If you find a bug, send a pull request and we'll discuss things. If you are not familiar with "pull request" term I recommend reading the following article for better understanding
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.