Skip to main content

Distributed Python Active Learning library

Project description


                             DPyACL
            Distributed Python Framework for Active Learning

                           May 2020
		    Alfredo Lorie Bernardo							

                         version 0.3

Introduction

DPyACL is a flexible Distributed Active Learning library written in Python, aimed to make active learning experiments simpler and faster. Its leverage Dask distributed features to execute active learning experiments computations among a cluster of computers, allowing to speed up computation and tackle scenarios where data doesn't fit in a single computer. It also has been developed with a modular object-oriented design to provide an intuitive, ease of use interface and to allow reuse, modification, and extensibility. It also offers full compatibility with libraries like NumPy, SciPy, Pandas, Scikit-learn and Keras. This library is available in PyPI and distributed under the GNU license.4

Up to date, DPyACL heavily uses Dask library to implement in a distributed and parallel fashion the the most significant strategies strategies that have appeared on the single_label-label. For next versions, we hope to include strategies strategies related with multi-label learning paradigms.

Download

GitHub: https://github.com/a24lorie/DPyACL

Using DPyACL

The fastest way to use DPyACL is from a Jupyter Notebook.

Preparing an experiment

When defining an Active Learning experiment DPyACL offers set pre-defined components that can be configured and combined by the user to better fit its needs. The required components to setup and experiment are listed below

  1. The Dataset
  2. Labelled and unlabelled sets: Optional - The experiment might be configured to randomly choose an initial labeled and unlabeled sets
  3. An Experiment: HoldOut and KFold experiments are provided
  4. The AL scenario: The current release provides a Pool Based Scenario
  5. The Machine Learning Technique: It can be a machine learning technique from any library that provides an API compatible with the fit, predict and predict_proba definitions. Sklearn, Dask-ML, Keras are compatible
  6. The Evaluation Method(s)
  7. The Query Strategy
  8. The Stopping Criteria
  9. The Oracle: The current release provides a Simulated Oracle

Configuring the experiment

ml_technique = LogisticRegression(solver='liblinear')
stopping_criteria = MaxIteration(50)
query_strategy =  QueryMarginSampling()
performance_metrics = [
                Accuracy(),
                F1(average='macro'),
                Precision(average='macro'),
                Recall(average='macro')]

experiment = HoldOutExperiment(
    _X,
    _y,
    scenario_type=PoolBasedSamplingScenario,
    train_idx=train_idx,
    test_idx=test_idx,
    label_idx=label_idx,
    unlabel_idx=unlabel_idx,
    ml_technique=ml_technique,
    performance_metrics=performance_metrics,
    query_strategy=query_strategy,
    oracle=SimulatedOracleQueryIndex(labels=_y),
    stopping_criteria=stopping_criteria,
    self_partition=False
)

Execute the experiment

 result = experiment.evaluate(verbose=True)

Analyze the experiment results

 query_analyser = ExperimentAnalyserFactory.experiment_analyser(
                            performance_metrics= [metric.metric_name for metric in performance_metrics],
                            method_name=query_strategy.query_function_name,
                            method_results=result,
                            type="queries"
                        )

# get a brief description of the experiment
query_analyser.plot_learning_curves(title='Active Learning experiment results')

Contribution

If you find a bug, send a pull request and we'll discuss things. If you are not familiar with "pull request" term I recommend reading the following article for better understanding

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpyacl-0.3.1.tar.gz (49.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page