Sequence labeling active learning framework for Python

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries

Project description

SeqAL

Supported Python versions License

SeqAL is a sequence labeling active learning framework based on Flair.

Installation

Install this via pip (or your favourite package manager):

pip install seqal

Usage

To understand what SeqAL can do, we first introduce the pool-based active learning cycle.

al_cycle

Step 0: Prepare seed data (a small number of labeled data used for training)
Step 1: Train the model with seed data
- Step 2: Predict unlabeled data with the trained model
- Step 3: Query informative samples based on predictions
- Step 4: Annotator (Oracle) annotate the selected samples
- Step 5: Input the new labeled samples to labeled dataset
- Step 6: Retrain model
Repeat step2~step6 until the f1 score of the model beyond the threshold or annotation budget is no left

SeqAL can cover all steps except step 0 and step 4. Below is a simple script to demonstrate how to use SeqAL to implement the work flow.

from seqal.active_learner import ActiveLearner
from seqal.samplers import LeastConfidenceSampler
from seqal.alinger import Alinger
from seqal.datasets import ColumnCorpus
from seqal.utils import load_plain_text
from xxxx import annotate_by_human  # User need to prepare this method


# Step 0: Preparation
## Prepare Seed data, valid data, and test data
columns = {0: "text", 1: "pos", 2: "syntactic_chunk", 3: "ner"}
data_folder = "./datasets/conll"
corpus = ColumnCorpus(
    data_folder,
    columns,
    train_file="train_seed.txt",
    dev_file="valid.txt",
    test_file="test.txt",
)

## Unlabeled data pool
file_path = "./datasets/conll/train_datapool.txt"
unlabeled_sentences = load_plain_text(file_path)

## Initilize ActiveLearner
learner = ActiveLearner(
  tagger_params=tagger_params,   # Model parameters (hidden size, embedding, etc.)
  query_strategy=LeastConfidenceSampler(),  # Query algorithm
  corpus=corpus,                 # Corpus contains training, validation, test data
  trainer_params=trainer_params  # Trainer parameters (epoch, batch size, etc.)
)

# Step 1: Initial training on model
learner.initialize()

# Step 2&3: Predict on unlabeled data and query informative data
_, queried_samples = learner.query(data_pool)
queried_samples = [{"text": sent.to_plain_string()} for sent in queried_samples]  # Convert sentence class to plain text
# queried_samples:
# [
#   {
#     "text": "Tokyo is a city"
#   }
# ]

# Step 4: Annotator annotate the selected samples
new_labels = annotate_by_human(queried_samples)
# new_labels:
# [
#   {
#     "text": ['Tokyo', 'is', 'a', 'city'],
#     "labels": ['B-LOC', 'O', 'O', 'O']
#   }
# ]

## Convert data to Sentence class
alinger = Alinger()
new_labeled_samples = alinger.add_tags_on_token(new_labels, 'ner')

Tutorials

We provide a set of quick tutorials to get you started with the library.

Performance

Active learning algorithms achieve 97% performance of the best deep model trained on full data using only 30%% of the training data on the CoNLL 2003 English dataset. The CPU model can decrease the time cost greatly only sacrificing a little performance.

See performance.md for more detail about performance and time cost.

Construct envirement locally

If you want to make a PR or implement something locally, you can follow bellow instruction to construct the development envirement locally.

First we create a environment "seqal" based on the environment.yml file.

We use conda as envirement management tool, so install it first.

conda env create -f environment.yml

Then we activate the environment.

conda activate seqal

Install poetry for dependency management.

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Add poetry path in your shell configure file (bashrc, zshrc, etc.)

export PATH="$HOME/.poetry/bin:$PATH"

Installing dependencies from pyproject.toml.

poetry install

You can make development locally now.

If you want to delete the local envirement, run below command.

conda remove --name seqal --all

Credits

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries

Release history Release notifications | RSS feed

0.3.5

Oct 19, 2022

0.3.4

Oct 18, 2022

0.3.3

Oct 13, 2022

0.3.2

Oct 13, 2022

0.3.1

Aug 22, 2022

This version

0.3.0

Aug 17, 2022

0.2.2

Aug 20, 2021

0.2.0

Aug 4, 2021

0.1.3

Jun 7, 2021

0.1.0

Jun 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqal-0.3.0.tar.gz (24.9 kB view hashes)

Uploaded Aug 17, 2022 Source

Built Distribution

seqal-0.3.0-py3-none-any.whl (25.3 kB view hashes)

Uploaded Aug 17, 2022 Python 3

Hashes for seqal-0.3.0.tar.gz

Hashes for seqal-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`551878017f39c0549819dc8060c456feef0f09af9137d2f8f99f04de1859ec2f`
MD5	`ca1d4d4723777fdd31e7926cf36b61bb`
BLAKE2b-256	`5317c889f16b26e16d6f63f2770772ceeae7a8ca7dc44a0dff911e852a7e4022`

Hashes for seqal-0.3.0-py3-none-any.whl

Hashes for seqal-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`988824fd3f0adf721f1d47af64c5b9136871eb8e223252c25501aedea1af1c7c`
MD5	`29ec53a7aad341911a19b838134ebc86`
BLAKE2b-256	`db582ab9ed6f1d16c7fd6171ff3963e784600ad2a3b6f3fa17e65d275f54863a`