Sequence Tagger for Partially Annotated Dataset in PyTorch
Project description
pytorch-partial-tagger
pytorch-partial-tagger
is a Python library for building a sequence tagger, specifically for the common NLP task Named Entity Recognition, with a partially annotated dataset in PyTorch.
You can build your own tagger using a distantly-supervised dataset obtained from unlabled text data and a dictionary that maps surface names to their entity type.
The algorithm of this library is based on Effland and Collins. (2021).
Usage
Import all dependencies first:
import torch
from sequence_label import SequenceLabel
from partial_tagger.metric import Metric
from partial_tagger.utils import create_trainer
Prepare your own datasets. Each item of dataset must have a pair of a string and a sequence label.
A string represents text that you want to assign a label, which is defined as text
below.
A sequence label represent a set of a character-based tag, which has a start, a length, and a label, which are defined as label
below.
A start represents a position in the text where a tag starts.
A length represents a distance in the text between the beginning of a tag and the end of a tag.
A label represents what you want to assign to a span of the text defined by a start and a length.
text = "Tokyo is the capital of Japan."
label = SequenceLabel.from_dict(
tags=[
{"start": 0, "end": 5, "label": "LOC"}, # Tag for Tokyo
{"start": 24, "end": 29, "label": "LOC"}, # Tag for Japan
],
size=len(text),
)
train_dataset = [(text, label), ...]
validation_dataset = [...]
test_dataset = [...]
Here, you will train your tagger and evaluate its performance.
You will train it through an instance of Trainer
, which you get by calling create_trainer
.
After a training, you will get an instance of Recognizer
which predicts character-based tags from given texts.
You will evaluate the performance of your tagger using an instance of Metric
as follows.
device = torch.device("cuda")
trainer = create_trainer()
recognizer = trainer(train_dataset, validation_dataset, device)
texts, ground_truths = zip(*test_dataset)
batch_size = 15
predictions = recognizer(texts, batch_size, device)
metric = Metric()
metric(predictions, ground_truths)
print(metric.get_scores()) # Display F1-score, Precision, Recall
Installation
pip install pytorch-partial-tagger
Documentation
For details about the pytorch-partial-tagger
API, see the documentation.
References
- Thomas Effland and Michael Collins. 2021. Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss. Transactions of the Association for Computational Linguistics, 9:1320–1335.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pytorch_partial_tagger-0.1.18.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 588a5cd75c6aac550b0c510ed623882ae1cef3e4dc31f4a6fc3b84a2714ce099 |
|
MD5 | 07dcd1d43acabb514ffcd162d2a43496 |
|
BLAKE2b-256 | 5a2adad12632267dea7c15df7efd4ac753aa97165689cb4a1bec0a78757250dc |
Hashes for pytorch_partial_tagger-0.1.18-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38657a6d32e0f11abf8756ca98ee2b53a0144f09b3a8067cd2fa4eb00fcda4ef |
|
MD5 | e1a789cd27096d425b3ca7a773d29b79 |
|
BLAKE2b-256 | 4269ee76ba8ab8abc6d796970a109466a82f1e39392956e4a2694946d5aff22a |