Sequence Tagger for Partially Annotated Dataset in PyTorch
Project description
pytorch-partial-tagger
pytorch-partial-tagger
is a Python library for building a sequence tagger, specifically for the common NLP task Named Entity Recognition, with a partially annotated dataset in PyTorch.
You can build your own tagger using a distantly-supervised dataset obtained from unlabled text data and a dictionary that maps surface names to their entity type.
The algorithm of this library is based on Effland and Collins. (2021).
Usage
Import all dependencies first:
import torch
from sequence_label import SequenceLabel
from partial_tagger.metric import Metric
from partial_tagger.utils import create_trainer
Prepare your own datasets. Each item of dataset must have a pair of a string and a sequence label.
A string represents text that you want to assign a label, which is defined as text
below.
A sequence label represent a set of a character-based tag, which has a start, a length, and a label, which are defined as label
below.
A start represents a position in the text where a tag starts.
A length represents a distance in the text between the beginning of a tag and the end of a tag.
A label represents what you want to assign to a span of the text defined by a start and a length.
text = "Tokyo is the capital of Japan."
label = SequenceLabel.from_dict(
tags=[
{"start": 0, "end": 5, "label": "LOC"}, # Tag for Tokyo
{"start": 24, "end": 29, "label": "LOC"}, # Tag for Japan
],
size=len(text),
)
train_dataset = [(text, label), ...]
validation_dataset = [...]
test_dataset = [...]
Here, you will train your tagger and evaluate its performance.
You will train it through an instance of Trainer
, which you get by calling create_trainer
.
After a training, you will get an instance of Recognizer
which predicts character-based tags from given texts.
You will evaluate the performance of your tagger using an instance of Metric
as follows.
device = torch.device("cuda")
trainer = create_trainer()
recognizer = trainer(train_dataset, validation_dataset, device)
texts, ground_truths = zip(*test_dataset)
batch_size = 15
predictions = recognizer(texts, batch_size, device)
metric = Metric()
metric(predictions, ground_truths)
print(metric.get_scores()) # Display F1-score, Precision, Recall
Installation
pip install pytorch-partial-tagger
Documentation
For details about the pytorch-partial-tagger
API, see the documentation.
References
- Yuta Tsuboi, Hisashi Kashima, Shinsuke Mori, Hiroki Oda, and Yuji Matsumoto. 2008. Training Conditional Random Fields Using Incomplete Annotations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 897–904, Manchester, UK. Coling 2008 Organizing Committee.
- Alexander Rush. 2020. Torch-Struct: Deep Structured Prediction Library. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 335–342, Online. Association for Computational Linguistics.
- Thomas Effland and Michael Collins. 2021. Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss. Transactions of the Association for Computational Linguistics, 9:1320–1335.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pytorch_partial_tagger-0.1.15.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97d5fc6f55dd900a35691a738b1cd8720956a94fb858f910831267f807f86228 |
|
MD5 | e29766a87e48d58400737c28c21210a5 |
|
BLAKE2b-256 | a7cee7ee5908f17c9a5ebd0ee0a52d43fe145e97e7f77034b59a18ece70ab255 |
Hashes for pytorch_partial_tagger-0.1.15-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46d27e3303dc7422aba279f3ff6d76ba20dc25271ff03c50e2f4ec47999ca5a5 |
|
MD5 | f1319b9e36ec0bcf687292bc16e86ab4 |
|
BLAKE2b-256 | 03395d68ef1c46a251e37cdaa27bedb37e66ff8e90fdce50d8556eecf71723fc |