Sequence labeling active learning framework for Python
Project description
SeqAL
SeqAL is a sequence labeling active learning framework based on Flair.
Installation
Install this via pip (or your favourite package manager):
pip install seqal
Usage
To understand what SeqAL can do, we first introduce the pool-based active learning cycle.
- Step 0: Prepare seed data (a small number of labeled data used for training)
- Step 1: Train the model with seed data
- Step 2: Predict unlabeled data with the trained model
- Step 3: Query informative samples based on predictions
- Step 4: Annotator (Oracle) annotate the selected samples
- Step 5: Input the new labeled samples to labeled dataset
- Step 6: Retrain model
- Repeat step2~step6 until the f1 score of the model beyond the threshold or annotation budget is no left
SeqAL can cover all steps except step 0 and step 4. Below is a simple script to demonstrate how to use SeqAL to implement the work flow.
from seqal.active_learner import ActiveLearner
from seqal.samplers import LeastConfidenceSampler
from seqal.alinger import Alinger
from seqal.datasets import ColumnCorpus
from seqal.utils import load_plain_text
from xxxx import annotate_by_human # User need to prepare this method
# Step 0: Preparation
## Prepare Seed data, valid data, and test data
columns = {0: "text", 1: "pos", 2: "syntactic_chunk", 3: "ner"}
data_folder = "./datasets/conll"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="train_seed.txt",
dev_file="valid.txt",
test_file="test.txt",
)
## Unlabeled data pool
file_path = "./datasets/conll/train_datapool.txt"
unlabeled_sentences = load_plain_text(file_path)
## Initilize ActiveLearner
learner = ActiveLearner(
tagger_params=tagger_params, # Model parameters (hidden size, embedding, etc.)
query_strategy=LeastConfidenceSampler(), # Query algorithm
corpus=corpus, # Corpus contains training, validation, test data
trainer_params=trainer_params # Trainer parameters (epoch, batch size, etc.)
)
# Step 1: Initial training on model
learner.initialize()
# Step 2&3: Predict on unlabeled data and query informative data
_, queried_samples = learner.query(data_pool)
queried_samples = [{"text": sent.to_plain_string()} for sent in queried_samples] # Convert sentence class to plain text
# queried_samples:
# [
# {
# "text": "Tokyo is a city"
# }
# ]
# Step 4: Annotator annotate the selected samples
new_labels = annotate_by_human(queried_samples)
# new_labels:
# [
# {
# "text": ['Tokyo', 'is', 'a', 'city'],
# "labels": ['B-LOC', 'O', 'O', 'O']
# }
# ]
## Convert data to the suitable format
alinger = Alinger()
new_labeled_samples = alinger.add_tags_on_token(new_labels, 'ner')
# Step 5&6: Add new labeled samples to training and retrain model
learner.teach(new_labeled_samples)
Tutorials
We provide a set of quick tutorials to get you started with the library.
- Tutorial 1: Introduction
- Tutorial 2: Prepare Corpus
- Tutorial 3: Active Learner Setup
- Tutorial 4: Prepare Data Pool
- Tutorial 5: Research and Annotation Mode
- Tutorial 6: Query Setup
- Tutorial 7: Annotated Data
- Tutorial 8: Stopper
- Tutorial 9: Ouput Labeled Data
- Tutorial 10: Performance Recorder
- Tutorial 11: Multiple Language Support
Performance
Active learning algorithms achieve 97% performance of the best deep model trained on full data using only 30% of the training data on the CoNLL 2003 English dataset. The CPU model can decrease the time cost greatly only sacrificing a little performance.
See performance for more detail about performance and time cost.
Construct envirement locally
If you want to make a PR or implement something locally, you can follow bellow instruction to construct the development envirement locally.
First we create a environment "seqal" based on the environment.yml
file.
We use conda as envirement management tool, so install it first.
conda env create -f environment.yml
Then we activate the environment.
conda activate seqal
Install poetry for dependency management.
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
Add poetry path in your shell configure file (bashrc
, zshrc
, etc.)
export PATH="$HOME/.poetry/bin:$PATH"
Installing dependencies from pyproject.toml
.
poetry install
You can make development locally now.
If you want to delete the local envirement, run below command.
conda remove --name seqal --all
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.