Skip to main content

Sequence labeling active learning framework for Python

Project description

SeqAL

CI Status Poetry black pre-commit

PyPI Version Supported Python versions License

SeqAL is a sequence labeling active learning framework based on Flair.

Installation

Install this via pip (or your favourite package manager):

pip install seqal

Usage

Prepare data

The tagging scheme is the IOB scheme.

    U.N. NNP I-ORG
official NN  O
   Ekeus NNP I-PER
   heads VBZ O
     for IN  O
 Baghdad NNP I-LOC
       . .   O

Each line contains four fields: the word, its partof-speech tag and its named entity tag. Words tagged with O are outside of named entities.

Examples

Because SeqAL is based on flair, we heavily recommend to read the tutorial of flair first.

import json

from flair.embeddings import StackedEmbeddings, WordEmbeddings

from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus, ColumnDataset
from seqal.query_strategies import mnlp_sampling

# 1. get the corpus
columns = {0: "text", 1: "pos", 2: "ner"}
data_folder = "../conll"
corpus = ColumnCorpus(
    data_folder,
    columns,
    train_file="seed.data",
    dev_file="dev.data",
    test_file="test.data",
)

First we need to create the corpus. date_folder is the directry path where we store datasets. seed.data contains NER labels, which usually just a small part of data (around 2% of total train data). dev.data and test.data should contains NER labels for evaluation. All three kinds of data should follow the IOB scheme. But if you have 4 columns, you can just change columns to specify the tag column.

# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner"  # what tag do we want to predict?
tagger_params["hidden_size"] = 256
embedding_types = [WordEmbeddings("glove")]
embeddings = StackedEmbeddings(embeddings=embedding_types)
tagger_params["embeddings"] = embeddings

# 3. Trainer params
trainer_params = {}
trainer_params["max_epochs"] = 10
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.01
trainer_params["train_with_dev"] = True

# 4. initialize learner
learner = ActiveLearner(tagger_params, mnlp_sampling, corpus, trainer_params)

This part is where we set the parameters for sequence tagger and trainer. The above setup can conver most of situations. If you want to add more paramters, I recommend to the read SequenceTagger and ModelTrainer in flair.

# 5. initial training
learner.fit(save_path="output/init_train")

The initial training will be trained on the seed data.

# 6. prepare data pool
pool_columns = {0: "text", 1: "pos"}
pool_file = data_folder + "/pool.data"
data_pool = ColumnDataset(pool_file, pool_columns)
sents = data_pool.sentences

Here we prepare the unlabeled data pool.

# 7. query data
query_number = 1
sents, query_samples = learner.query(sents, query_number, token_based=True)

We can query samples from data pool by the learner.query() method. query_number means how many sentence we want to query. But if we set token_based=True, the query_number means how many tokens we want to query. For the sequence labeling task, we usually set token_based=True.

query_samples is a list that contains queried sentences (the Sentence class in flair). sents contains the rest of unqueried sentences.

In [1]: query_samples[0].to_plain_string()
Out[1]: 'I love Berlin .'

We can get the text by calling to_plain_strin() method and put it into the interface for human annotation.

# 8. obtaining labels for "query_samples" by the human
query_labels = [
      {
        "text": "I love Berlin .",
        "labels": [{"start_pos": 7, "text": "Berlin", "label": "S-LOC"}]
      },
      {
        "text": "This book is great.",
        "labels": []
      }
]


annotated_sents = assign_labels(query_labels)

query_labels is the label information of a sentence after annotation by human. We use such information to create Flair Sentence class by calling assign_labels() method.

For more detail, see Adding labels to sentences

# 9. retrain model with new labeled data
learner.teach(annotated_sents, save_path=f"output/retrain")

Finally, we call learner.teach() to retrain the model. The annotated_sents will be added to corpus.train automatically.

If you want to run the workflow in a loop, you can take a look at the examples folders.

Construct envirement locally

If you want to make a PR or implement something locally, you can follow bellow instruction to construct the development envirement locally.

First we create a environment "seqal" based on the environment.yml file.

We use conda as envirement management tool, so install it first.

conda env create -f environment.yml

Then we activate the environment.

conda activate seqal

Install poetry for dependency management.

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Add poetry path in your shell configure file (bashrc, zshrc, etc.)

export PATH="$HOME/.poetry/bin:$PATH"

Installing dependencies from pyproject.toml.

poetry install

You can make development locally now.

If you want to delete the local envirement, run below command.

conda remove --name seqal --all

Performance

See performance.md for detail.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

This project follows the all-contributors specification. Contributions of any kind welcome!

Credits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqal-0.2.2.tar.gz (13.6 kB view hashes)

Uploaded Source

Built Distribution

seqal-0.2.2-py3-none-any.whl (10.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page