Sequence labeling active learning framework for Python
Project description
SeqAL
SeqAL is a sequence labeling active learning framework based on Flair.
Installation
SeqAL is available on PyPI:
pip install seqal
SeqAL officially supports Python 3.8+.
Usage
To understand what SeqAL can do, we first introduce the pool-based active learning cycle.
- Step 0: Prepare seed data (a small number of labeled data used for training)
- Step 1: Train the model with seed data
- Step 2: Predict unlabeled data with the trained model
- Step 3: Query informative samples based on predictions
- Step 4: Annotator (Oracle) annotate the selected samples
- Step 5: Input the new labeled samples to labeled dataset
- Step 6: Retrain model
- Repeat step2~step6 until the f1 score of the model beyond the threshold or annotation budget is no left
SeqAL can cover all steps except step 0 and step 4. Because there is no 3rd part annotation tool, we can run below script to simulate the active learning cycle.
from flair.embeddings import WordEmbeddings
from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus, ColumnDataset
from seqal.samplers import LeastConfidenceSampler
# 1. get the corpus
columns = {0: "text", 1: "ner"}
data_folder = "./data/sample_bio"
corpus = ColumnCorpus(
data_folder,
columns,
train_file="train_seed.txt",
dev_file="dev.txt",
test_file="test.txt",
)
# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner"
tagger_params["hidden_size"] = 256
embeddings = WordEmbeddings("glove")
tagger_params["embeddings"] = embeddings
tagger_params["use_rnn"] = False
# 3. trainer params
trainer_params = {}
trainer_params["max_epochs"] = 1
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.1
trainer_params["patience"] = 5
# 4. setup active learner
sampler = LeastConfidenceSampler()
learner = ActiveLearner(corpus, sampler, tagger_params, trainer_params)
# 5. initialize active learner
learner.initialize(dir_path="output/init_train")
# 6. prepare data pool
pool_file = data_folder + "/labeled_data_pool.txt"
data_pool = ColumnDataset(pool_file, columns)
unlabeled_sentences = data_pool.sentences
# 7. query setup
query_number = 2
token_based = False
iterations = 5
# 8. iteration
for i in range(iterations):
# 9. query unlabeled sentences
queried_samples, unlabeled_sentences = learner.query(
unlabeled_sentences, query_number, token_based=token_based, research_mode=True
)
# 10. retrain model, the queried_samples will be added to corpus.train
learner.teach(queried_samples, dir_path=f"output/retrain_{i}")
When calling learner.query(), we set research_mode=True. This means that we simulate the active learning cycle. You can also find the script in examples/active_learning_cycle_research_mode.py. If you want to connect SeqAL with an annotation tool, you can see the script in examples/active_learning_cycle_annotation_mode.py.
Tutorials
We provide a set of quick tutorials to get you started with the library.
- Tutorials on Github Page
- Tutorials on Markown
- Tutorial 1: Introduction
- Tutorial 2: Prepare Corpus
- Tutorial 3: Active Learner Setup
- Tutorial 4: Prepare Data Pool
- Tutorial 5: Research and Annotation Mode
- Tutorial 6: Query Setup
- Tutorial 7: Annotated Data
- Tutorial 8: Stopper
- Tutorial 9: Output Labeled Data
- Tutorial 10: Performance Recorder
- Tutorial 11: Multiple Language Support
Performance
Active learning algorithms achieve 97% performance of the best deep model trained on full data using only 30% of the training data on the CoNLL 2003 English dataset. The CPU model can decrease the time cost greatly only sacrificing a little performance.
See performance for more detail about performance and time cost.
Contributing
If you have suggestions for how SeqAL could be improved, or want to report a bug, open an issue! We'd love all and any contributions.
For more, check out the Contributing Guide.
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seqal-0.3.4.tar.gz.
File metadata
- Download URL: seqal-0.3.4.tar.gz
- Upload date:
- Size: 24.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.13 Darwin/19.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30d6bb4f410bb0199efa38ec6f238f9c1936516c8756ee84b35d6140c22b88b3
|
|
| MD5 |
f908148c6112720093b13b00cc3c1e96
|
|
| BLAKE2b-256 |
6e4420a07a48a2ada22c58d895384104ece551f3043c7da2b8d1242e0f48f260
|
File details
Details for the file seqal-0.3.4-py3-none-any.whl.
File metadata
- Download URL: seqal-0.3.4-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.13 Darwin/19.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bed2337c1c93e3fa54f0d496a5b203967a44dc2b8e21613ebaf4c16c317ca83
|
|
| MD5 |
0b2f5150097273c4b209d73898d76c08
|
|
| BLAKE2b-256 |
016fcc011561423103e69dcc991f89154589f2399732bc0b3d595d71a4d1145f
|