Skip to main content

Sequence Tagger for Partially Annotated Dataset in PyTorch

Project description

pytorch-partial-tagger

Open In Colab

pytorch-partial-tagger is a Python library for building a sequence tagger, specifically for the common NLP task Named Entity Recognition, with a partially annotated dataset in PyTorch. You can build your own tagger using a distantly-supervised dataset obtained from unlabled text data and a dictionary that maps surface names to their entity type. The algorithm of this library is based on Effland and Collins. (2021).

Usage

Import all dependencies first:

import torch
from sequence_label import SequenceLabel

from partial_tagger.metric import Metric
from partial_tagger.utils import create_trainer

Prepare your own datasets. Each item of dataset must have a pair of a string and a sequence label. A string represents text that you want to assign a label, which is defined as text below. A sequence label represent a set of a character-based tag, which has a start, a length, and a label, which are defined as label below. A start represents a position in the text where a tag starts. A length represents a distance in the text between the beginning of a tag and the end of a tag. A label represents what you want to assign to a span of the text defined by a start and a length.

text = "Tokyo is the capital of Japan."
label = SequenceLabel.from_dict(
    tags=[
        {"start": 0,  "end": 5, "label": "LOC"},  # Tag for Tokyo
        {"start": 24,  "end": 29, "label": "LOC"},  # Tag for Japan
    ],
    size=len(text),
)

train_dataset = [(text, label), ...]
validation_dataset = [...]
test_dataset = [...]

Here, you will train your tagger and evaluate its performance. You will train it through an instance of Trainer, which you get by calling create_trainer. After a training, you will get an instance of Recognizer which predicts character-based tags from given texts. You will evaluate the performance of your tagger using an instance of Metric as follows.

device = torch.device("cuda")

trainer = create_trainer()
recognizer = trainer(train_dataset, validation_dataset, device)

texts, ground_truths = zip(*test_dataset)

batch_size = 15
predictions = recognizer(texts, batch_size, device)

metric = Metric()
metric(predictions, ground_truths)

print(metric.get_scores())  # Display F1-score, Precision, Recall

Installation

pip install pytorch-partial-tagger

Documentation

For details about the pytorch-partial-tagger API, see the documentation.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_partial_tagger-0.1.18.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

pytorch_partial_tagger-0.1.18-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file pytorch_partial_tagger-0.1.18.tar.gz.

File metadata

File hashes

Hashes for pytorch_partial_tagger-0.1.18.tar.gz
Algorithm Hash digest
SHA256 588a5cd75c6aac550b0c510ed623882ae1cef3e4dc31f4a6fc3b84a2714ce099
MD5 07dcd1d43acabb514ffcd162d2a43496
BLAKE2b-256 5a2adad12632267dea7c15df7efd4ac753aa97165689cb4a1bec0a78757250dc

See more details on using hashes here.

File details

Details for the file pytorch_partial_tagger-0.1.18-py3-none-any.whl.

File metadata

File hashes

Hashes for pytorch_partial_tagger-0.1.18-py3-none-any.whl
Algorithm Hash digest
SHA256 38657a6d32e0f11abf8756ca98ee2b53a0144f09b3a8067cd2fa4eb00fcda4ef
MD5 e1a789cd27096d425b3ca7a773d29b79
BLAKE2b-256 4269ee76ba8ab8abc6d796970a109466a82f1e39392956e4a2694946d5aff22a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page