Skip to main content

A Library for active learning. Supports text classification and sequence tagging tasks.

Project description

🛠 ALToolbox

ALToolbox is a framework for practical active learning in NLP.


Installation | Quick Start | Overview | Docs | Citation

ALToolbox is a framework for active learning annotation in natural language processing. Currently, the framework supports text classification and sequence tagging tasks. ALToolbox provides state-of-the-art query strategies, serverless annotation tool for Jupyter IDE, and a set of tools that help to reduce computational overhead / duration of AL iterations and increase annotated data reusability.

⚙️ Installation

pip install -e .

To annotate instances for active learning in Jupyter Notebook or Jupyter Lab one have to install additional widget after framework installation. In case of Jupyter Notebook usage run:

jupyter nbextension install --py --symlink --sys-prefix text_selector
jupyter nbextension enable --py --sys-prefix text_selector

In case of Jupyter Lab usage run:

jupyter labextension install js
jupyter labextension install text_selector

💫 Quick Start

For quick start, please see the examples of launching an active learning annotation or benchmarking a novel query stategy / unlabeled pool subsampling strategy for sequence tagging and text classification tasks:

# Notebook
1 Launching Active Learning for Token Classification
2 Launching Active Learning for Text Classification
3 Benchmarking a novel AL query strategy / unlabeled pool subsampling strategy

🔭 Overview

1. Query Strategies

# Strategy Citation
1 ALPS Citation
2 BADGE Citation
3 BAIT Citation
4 BALD Citation
5 BatchBALD Citation
6 Breaking Ties (BT) (also Maximum Margin) Citation
7 Contrastive Active Learning (CAL) Citation
8 Cluster Margin Citation
9 Coreset Citation
10 Expected Gradient Length (EGL) Citation
11 Embeddings KM Citation
12 Entropy Citation
13 Least Confidence (LC) Citation
14 Mahalanobis Distance Citation
15 Maximum Normalized Log-Probability (MNLP) Citation
16 Random (No AL) -

3. Unlabeled Pool Subsampling Strategies

# Strategy Citation
1 UPS Citation
2 Naïve Citation
3 Random -

4. Pipelines for postprocessing of annotated data and preparation of acquisition models

  • PLASM postprocessing pipeline for annotated data reusability.
  • Acquisition model distillation.
  • Domain adaptation of acquisition models.

5. GUI Annotator tool in Jupyter IDE

Our framework provides a serverless GUI annotation tool integrated into the Jupyter IDE: GUI

6. Extensible benchmark for query strategies

TODO:

📕 Documentation

Usage

The configs folder contains config files with general settings. The experiments folder contains config files with experimental design. To run an experiment with a chosen configuration, specify config file name in HYDRA_CONFIG_NAME variable and run train.sh script (see ./examples/al for details).

For example to launch PLASM on AG-News with ELECTRA as a successor model:

cd PATH_TO_THIS_REPO
HYDRA_CONFIG_PATH=../experiments/ag_news HYDRA_EXP_CONFIG_NAME=ag_plasm python active_learning/run_tasks_on_multiple_gpus.py

Config structure explanation

  • cuda_devices: list of CUDA devices to use: one experiment on one CUDA device. cuda_devices=[0,1] means using zero-th and first devices.
  • config_name: name of config from configs folder with general settings: dataset, experiment setting (e.g. LC/ASM/PLASM), model checkpoints, hyperparameters etc.
  • config_path: path to config with general settings.
  • command: .py file to run. For AL experiments, use run_active_learning.py.
  • args: arguments to modify from a general config in the current experiment. acquisition_model.name=xlnet-base-cased means that xlnet-base-cased will be used as an acquisition model.
  • seeds: random seeds to use. seeds=[4837, 23419] means that two separate experiments with the same settings (except for seed) will be run: one with seed == 4837, one with seed == 23419.

Output Explanation

By default, the results will be present in the folder RUN_DIRECTORY/workdir_run_active_learning/DATE_OF_RUN/${TIME_OF_RUN}_${SEED}_${MODEL_CHECKPOINT}. For instance, when launching from the repository folder: al_nlp_feasible/workdir/run_active_learning/2022-06-11/15-59-31_23419_distilbert_base_uncased_bert_base_uncased.

  • When running a classic AL experiment (acquisition and successor models coincide, regardless of using UPS), the file with the model metrics is acquisition_metrics.json.
  • When running an acquisition-successor mismatch experiment, the file with the model metrics is successor_metrics.json.
  • When running a PLASM experiment, the file with the model metrics is target_tracin_quantile_-1.0_metrics.json (-1.0 stands for the filtering value, meaning adaptive filtering rate; when using a deterministic filtering rate (e.g. 0.1), the file will be named target_tracin_quantile_0.1_metrics.json). The file with the metrics of the model without filtering is target_metrics.json.

Post-processing

Our framework provides tools for effective data post-processing for its re-usability and a possibility to build powerful models on it. PLASM, which aims to alleviate the acquisition-successor mismatch problem and allow to build a model of an arbitrary type using the labeled data without performance degradation, is implemented in post_processing/pipeline_plasm. It uses the config cls_plasm / ner_plasm (from `jupyterlab_demo/configs). A brief explanation of the config structure:

  • pseudo-labeling model parameters are contained in the key labeling_model;
  • successor model parameters are contained in the key successor_model;
  • post-processing options are contained in the key post_processing:
    • label_smoothing: str / float / None, a parameter for label smoothing (LS) for pseudo-labeled instances. Accepts several options:
      • "adaptive": LS value equals the quality of the labeling model on the validation data.
      • float, 0 < value < 1: absolute value of label smoothing
      • None (default): no label smoothing is used
    • labeled_weight: int / float, weight for the labeled-by-human data. 1 < value < +inf
    • use_subsample_for_pl: int / float / None, the size of the subsample used for pseudo-labeling (float means taking the share of the unlabeled data). None means that no subsampling is used.
    • uncertainty_threshold: float / None, the value of the threshold for filtering by uncertainty. If None, no filtering by uncertainty is used.
    • filter_by_quantile: bool, only used for classification, ignored if uncertainty_threshold is None. If True, uncertainty_threshold most uncertain instances are filtered. Otherwise, all instances whose (1 - max_prob) < uncertainty_threshold are filtered.
    • tracin:
      • use: bool, whether to use TracIn for filtering
      • max_num_processes: int, value > 0, maximum number of processes per one GPU
      • quantile: str / float (0 < value < 1), share of unlabeled data instances to filter using the TracIn score.
      • num_model_checkpoints: int, value > 0, how many model checkpoints to save and use for TracIn.
      • nu: float / int, value for TracIn algorithm.

🆕️ New strategies addition

An AL query strategy should be designed as a function that:

  1. Receives 3 positional arguments and additional strategy kwargs: - model of inherited class TransformersBaseWrapper or PytorchEncoderWrapper or FlairModelWrapper: model wrapper; - X_pool of class Dataset or TransformersDataset: dataset with the unlabeled instances; - n_instances of class int: number of instances to query; - kwargs: additional strategy-specific arguments.
  2. Outputs 3 objects in the following order:
    • query_idx of class array-like: array with the indices of the queried instances;
    • query of class Dataset or TransformersDataset: dataset with the queried instances;
    • uncertainty_estimates of class np.ndarray: uncertainty estimates of the instances from X_pool. The higher the value - the more uncertain the model is in the instance.

The function with the strategy should be named the same as the file where it is placed (e.g. function def my_strategy inside a file path_to_strategy/my_strategy.py). Use your strategy, setting al.strategy=PATH_TO_FILE_YOUR_STRATEGY in the experiment config.

The example is presented in examples/benchmark_custom_strategy.ipynb

🆕️ New pool subsampling strategies addition

The addition of a new pool subsampling query strategy is similar to the addition of an AL query strategy. A subsampling strategy should be designed as a function that:

  1. It must receive 2 positional arguments and additional subsampling strategy kwargs: - uncertainty_estimates of class np.ndarray: uncertainty estimates of the instances in the order they are stored in the unlabeled data; - gamma_or_k_confident_to_save of class float or int: either a share / number of instances to save (as in random / naive subsampling) or an internal parameter (as in UPS); - kwargs: additional subsampling strategy specific arguments.
  2. It must output the indices of the instances to use (sampled indices) of class np.ndarray.

The function with the strategy should be named the same as the file where it is placed (e.g. function def my_subsampling_strategy inside a file path_to_strategy/my_subsampling_strategy.py). Use your subsampling strategy, setting al.sampling_type=PATH_TO_FILE_YOUR_SUBSAMPLING_STRATEGY in the experiment config.

The example is presented in examples/benchmark_custom_strategy.ipynb

Datasets

The research has employed 2 Token Classification datasets (CoNLL-2003, OntoNotes-2012) and 2 Text Classification datasets (AG-News, IMDB). If one wants to launch an experiment on a custom dataset, they need to use one of the following ways to add it:

  1. Upload to Hugging Face datasets and set: config.data.path=datasets, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME
  2. Upload to data/DATASET_NAME folder, create train.csv / train.json file with the dataset, and set: config.data.path=PATH_TO_THIS_REPO/data, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME
  3. * Upload to data/DATASET_NAME train.txt, dev.txt, and test.txt files and set the arguments as in the previous point.
  4. ** Upload to data/DATASET_NAME with each folder for each class, where each file in the folder contains a text with the label of the folder. For details, please see the bbc_news dataset in ./data. The arguments must be set as in the previous two points.

* - only for Token Classification datasets

** - only for Text Classification datasets

Models

The current version of the repository supports all models from HuggingFace Transformers, which can be used with AutoModelForSequenceClassification / AutoModelForTokenClassification classes (for Text / Token classification). For CNN-based / BiLSTM-CRF models, please see the al_cls_cnn.yaml / al_ner_bilstm_crf_flair.yaml configs from ./configs folder for details.

Testing

By default, the tests will be run on the cuda:0 device if CUDA is available or on CPU, otherwise. If one wants to manually specify the device for running the tests:

  • On CPU: CUDA_VISIBLE_DEVICES="" python -m pytest PATH_TO_REPO/tests;
  • On CUDA: CUDA_VISIBLE_DEVICES="DEVICE_OR_DEVICES_NUMBER" python -m pytest PATH_TO_REPO/tests.

We recommend to use CPU for the robustness of the results. The tests for CUDA are written under Tesla V100-SXM3 32GB, CUDA V.10.1.243.

👯 Alternatives

FAMIE, Small-Text, modAL, ALiPy, libact

💬 Citation


📄 License

© 2022 Autonomous Non-Profit Organization "Artificial Intelligence Research Institute" (AIRI). All rights reserved.

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tmpproject-0.0.2.tar.gz (171.5 kB view details)

Uploaded Source

File details

Details for the file tmpproject-0.0.2.tar.gz.

File metadata

  • Download URL: tmpproject-0.0.2.tar.gz
  • Upload date:
  • Size: 171.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.0

File hashes

Hashes for tmpproject-0.0.2.tar.gz
Algorithm Hash digest
SHA256 8a0ef2460a20b5d0391fc84e23c854645da6f0079a3abafe85382018671d93dd
MD5 b92c1d46a8a16461cafcb4374c8aecf8
BLAKE2b-256 932499d75e5e5347553d8102c31a2bc453d1c078a996317d8d47759fe0102386

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page