Skip to main content

Sequence labeling active learning framework for Python

Project description

SeqAL

Documentation Status CI Status Poetry black pre-commit

PyPI Version Supported Python versions License

SeqAL is a sequence labeling active learning framework based on Flair.

Installation

SeqAL is available on PyPI:

pip install seqal

SeqAL officially supports Python 3.8+.

Usage

To understand what SeqAL can do, we first introduce the pool-based active learning cycle.

al_cycle

  • Step 0: Prepare seed data (a small number of labeled data used for training)
  • Step 1: Train the model with seed data
    • Step 2: Predict unlabeled data with the trained model
    • Step 3: Query informative samples based on predictions
    • Step 4: Annotator (Oracle) annotate the selected samples
    • Step 5: Input the new labeled samples to labeled dataset
    • Step 6: Retrain model
  • Repeat step2~step6 until the f1 score of the model beyond the threshold or annotation budget is no left

SeqAL can cover all steps except step 0 and step 4. Because there is no 3rd part annotation tool, we can run below script to simulate the active learning cycle.

from flair.embeddings import WordEmbeddings

from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus, ColumnDataset
from seqal.samplers import LeastConfidenceSampler

# 1. get the corpus
columns = {0: "text", 1: "ner"}
data_folder = "./data/sample_bio"
corpus = ColumnCorpus(
    data_folder,
    columns,
    train_file="train_seed.txt",
    dev_file="dev.txt",
    test_file="test.txt",
)

# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner"
tagger_params["hidden_size"] = 256
embeddings = WordEmbeddings("glove")
tagger_params["embeddings"] = embeddings
tagger_params["use_rnn"] = False

# 3. trainer params
trainer_params = {}
trainer_params["max_epochs"] = 1
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.1
trainer_params["patience"] = 5

# 4. setup active learner
sampler = LeastConfidenceSampler()
learner = ActiveLearner(corpus, sampler, tagger_params, trainer_params)

# 5. initialize active learner
learner.initialize(dir_path="output/init_train")

# 6. prepare data pool
pool_file = data_folder + "/labeled_data_pool.txt"
data_pool = ColumnDataset(pool_file, columns)
unlabeled_sentences = data_pool.sentences

# 7. query setup
query_number = 2
token_based = False
iterations = 5

# 8. iteration
for i in range(iterations):
    # 9. query unlabeled sentences
    queried_samples, unlabeled_sentences = learner.query(
        unlabeled_sentences, query_number, token_based=token_based, research_mode=True
    )

    # 10. retrain model, the queried_samples will be added to corpus.train
    learner.teach(queried_samples, dir_path=f"output/retrain_{i}")

When calling learner.query(), we set research_mode=True. This means that we simulate the active learning cycle. You can also find the script in examples/active_learning_cycle_research_mode.py. If you want to connect SeqAL with an annotation tool, you can see the script in examples/active_learning_cycle_annotation_mode.py.

Tutorials

We provide a set of quick tutorials to get you started with the library.

Performance

Active learning algorithms achieve 97% performance of the best deep model trained on full data using only 30% of the training data on the CoNLL 2003 English dataset. The CPU model can decrease the time cost greatly only sacrificing a little performance.

See performance for more detail about performance and time cost.

Contributing

If you have suggestions for how SeqAL could be improved, or want to report a bug, open an issue! We'd love all and any contributions.

For more, check out the Contributing Guide.

Credits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqal-0.3.4.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seqal-0.3.4-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file seqal-0.3.4.tar.gz.

File metadata

  • Download URL: seqal-0.3.4.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.13 Darwin/19.6.0

File hashes

Hashes for seqal-0.3.4.tar.gz
Algorithm Hash digest
SHA256 30d6bb4f410bb0199efa38ec6f238f9c1936516c8756ee84b35d6140c22b88b3
MD5 f908148c6112720093b13b00cc3c1e96
BLAKE2b-256 6e4420a07a48a2ada22c58d895384104ece551f3043c7da2b8d1242e0f48f260

See more details on using hashes here.

File details

Details for the file seqal-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: seqal-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.13 Darwin/19.6.0

File hashes

Hashes for seqal-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7bed2337c1c93e3fa54f0d496a5b203967a44dc2b8e21613ebaf4c16c317ca83
MD5 0b2f5150097273c4b209d73898d76c08
BLAKE2b-256 016fcc011561423103e69dcc991f89154589f2399732bc0b3d595d71a4d1145f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page