Sequence labeling active learning framework for Python
Project description
SeqAL
SeqAL is a sequence labeling active learning framework based on Flair.
Installation
Install this via pip (or your favourite package manager):
pip install seqal
Usage
Prepare data
The tagging scheme is the IOB scheme.
U.N. NNP I-ORG
official NN O
Ekeus NNP I-PER
heads VBZ O
for IN O
Baghdad NNP I-LOC
. . O
Each line contains four fields: the word, its partof-speech tag and its named entity tag. Words tagged with O are outside of named entities.
Examples
Because SeqAL is based on flair, we heavily recommend to read the tutorial of flair first.
import json
from flair.embeddings import StackedEmbeddings, WordEmbeddings
from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus, ColumnDataset
from seqal.query_strategies import mnlp_sampling
# 1. get the corpus
columns = {0: "text", 1: "pos", 2: "ner"}
data_folder = "../conll"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="seed.data",
dev_file="dev.data",
test_file="test.data",
)
First we need to create the corpus. date_folder
is the directry path where we store datasets. seed.data
contains NER labels, which usually just a small part of data (around 2% of total train data). dev.data
and test.data
should contains NER labels for evaluation. All three kinds of data should follow the IOB scheme. But if you have 4 columns, you can just change columns
to specify the tag column.
# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner" # what tag do we want to predict?
tagger_params["hidden_size"] = 256
embedding_types = [WordEmbeddings("glove")]
embeddings = StackedEmbeddings(embeddings=embedding_types)
tagger_params["embeddings"] = embeddings
# 3. Trainer params
trainer_params = {}
trainer_params["max_epochs"] = 10
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.01
trainer_params["train_with_dev"] = True
# 4. initialize learner
learner = ActiveLearner(tagger_params, mnlp_sampling, corpus, trainer_params)
This part is where we set the parameters for sequence tagger and trainer. The above setup can conver most of situations. If you want to add more paramters, I recommend to the read SequenceTagger and ModelTrainer in flair.
# 5. initial training
learner.fit(save_path="output/init_train")
The initial training will be trained on the seed data.
# 6. prepare data pool
pool_columns = {0: "text", 1: "pos"}
pool_file = data_folder + "/pool.data"
data_pool = ColumnDataset(pool_file, pool_columns)
sents = data_pool.sentences
Here we prepare the unlabeled data pool.
# 7. query data
query_number = 1
sents, query_samples = learner.query(sents, query_number, token_based=True)
We can query samples from data pool by the learner.query()
method. query_number
means how many sentence we want to query. But if we set token_based=True
, the query_number
means how many tokens we want to query. For the sequence labeling task, we usually set token_based=True
.
query_samples
is a list that contains queried sentences (the Sentence class in flair). sents
contains the rest of unqueried sentences.
In [1]: query_samples[0].to_plain_string()
Out[1]: 'I love Berlin .'
We can get the text by calling to_plain_strin()
method and put it into the interface for human annotation.
# 8. obtaining labels for "query_samples" by the human
query_labels = [
{
"text": "I love Berlin .",
"labels": [{"start_pos": 7, "text": "Berlin", "label": "S-LOC"}]
},
{
"text": "This book is great.",
"labels": []
}
]
annotated_sents = assign_labels(query_labels)
query_labels
is the label information of a sentence after annotation by human. We use such information to create Flair Sentence class by calling assign_labels()
method.
For more detail, see Adding labels to sentences
# 9. retrain model with new labeled data
learner.teach(annotated_sents, save_path=f"output/retrain")
Finally, we call learner.teach()
to retrain the model. The annotated_sents
will be added to corpus.train
automatically.
If you want to run the workflow in a loop, you can take a look at the examples
folders.
Construct envirement locally
If you want to make a PR or implement something locally, you can follow bellow instruction to construct the development envirement locally.
First we create a environment "seqal" based on the environment.yml
file.
We use conda as envirement management tool, so install it first.
conda env create -f environment.yml
Then we activate the environment.
conda activate seqal
Install poetry for dependency management.
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
Add poetry path in your shell configure file (bashrc
, zshrc
, etc.)
export PATH="$HOME/.poetry/bin:$PATH"
Installing dependencies from pyproject.toml
.
poetry install
You can make development locally now.
If you want to delete the local envirement, run below command.
conda remove --name seqal --all
Performance
See performance.md for detail.
Contributors ✨
Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.