Sequence labeling active learning framework for Python
Project description
SeqAL
SeqAL is a sequence labeling active learning framework based on Flair.
Installation
SeqAL is available on PyPI:
pip install seqal
SeqAL officially supports Python 3.8+.
Usage
To understand what SeqAL can do, we first introduce the pool-based active learning cycle.
- Step 0: Prepare seed data (a small number of labeled data used for training)
- Step 1: Train the model with seed data
- Step 2: Predict unlabeled data with the trained model
- Step 3: Query informative samples based on predictions
- Step 4: Annotator (Oracle) annotate the selected samples
- Step 5: Input the new labeled samples to labeled dataset
- Step 6: Retrain model
- Repeat step2~step6 until the f1 score of the model beyond the threshold or annotation budget is no left
SeqAL can cover all steps except step 0 and step 4. Because there is no 3rd part annotation tool, we can run below script to simulate the active learning cycle.
from flair.embeddings import WordEmbeddings
from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus, ColumnDataset
from seqal.samplers import LeastConfidenceSampler
# 1. get the corpus
columns = {0: "text", 1: "ner"}
data_folder = "./data/sample_bio"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="train_seed.txt",
dev_file="dev.txt",
test_file="test.txt",
)
# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner"
tagger_params["hidden_size"] = 256
embeddings = WordEmbeddings("glove")
tagger_params["embeddings"] = embeddings
tagger_params["use_rnn"] = False
# 3. trainer params
trainer_params = {}
trainer_params["max_epochs"] = 1
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.1
trainer_params["patience"] = 5
# 4. setup active learner
sampler = LeastConfidenceSampler()
learner = ActiveLearner(corpus, sampler, tagger_params, trainer_params)
# 5. initialize active learner
learner.initialize(dir_path="output/init_train")
# 6. prepare data pool
pool_file = data_folder + "/labeled_data_pool.txt"
data_pool = ColumnDataset(pool_file, columns)
unlabeled_sentences = data_pool.sentences
# 7. query setup
query_number = 2
token_based = False
iterations = 5
# 8. iteration
for i in range(iterations):
# 9. query unlabeled sentences
queried_samples, unlabeled_sentences = learner.query(
unlabeled_sentences, query_number, token_based=token_based, research_mode=True
)
# 10. retrain model, the queried_samples will be added to corpus.train
learner.teach(queried_samples, dir_path=f"output/retrain_{i}")
When calling learner.query()
, we set research_mode=True
. This means that we simulate the active learning cycle. You can also find the script in examples/active_learning_cycle_research_mode.py
. If you want to connect SeqAL with an annotation tool, you can see the script in examples/active_learning_cycle_annotation_mode.py
.
Tutorials
We provide a set of quick tutorials to get you started with the library.
- Tutorials on Github Page
- Tutorials on Markown
- Tutorial 1: Introduction
- Tutorial 2: Prepare Corpus
- Tutorial 3: Active Learner Setup
- Tutorial 4: Prepare Data Pool
- Tutorial 5: Research and Annotation Mode
- Tutorial 6: Query Setup
- Tutorial 7: Annotated Data
- Tutorial 8: Stopper
- Tutorial 9: Output Labeled Data
- Tutorial 10: Performance Recorder
- Tutorial 11: Multiple Language Support
Performance
Active learning algorithms achieve 97% performance of the best deep model trained on full data using only 30% of the training data on the CoNLL 2003 English dataset. The CPU model can decrease the time cost greatly only sacrificing a little performance.
See performance for more detail about performance and time cost.
Contributing
If you have suggestions for how SeqAL could be improved, or want to report a bug, open an issue! We'd love all and any contributions.
For more, check out the Contributing Guide.
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.