Skip to main content

Sequence Tagger for Partially Annotated Dataset in PyTorch

Project description

pytorch-partial-tagger

Open In Colab

pytorch-partial-tagger is a Python library for building a sequence tagger, specifically for the common NLP task Named Entity Recognition, with a partially annotated dataset in PyTorch. You can build your own tagger using a distantly-supervised dataset obtained from unlabled text data and a dictionary that maps surface names to their entity type. The algorithm of this library is based on Effland and Collins. (2021).

Usage

Import all dependencies first:

import torch
from sequence_label import SequenceLabel

from partial_tagger.metric import Metric
from partial_tagger.utils import create_trainer

Prepare your own datasets. Each item of dataset must have a pair of a string and a sequence label. A string represents text that you want to assign a label, which is defined as text below. A sequence label represent a set of a character-based tag, which has a start, a length, and a label, which are defined as label below. A start represents a position in the text where a tag starts. A length represents a distance in the text between the beginning of a tag and the end of a tag. A label represents what you want to assign to a span of the text defined by a start and a length.

text = "Tokyo is the capital of Japan."
label = SequenceLabel.from_dict(
    tags=[
        {"start": 0,  "end": 5, "label": "LOC"},  # Tag for Tokyo
        {"start": 24,  "end": 29, "label": "LOC"},  # Tag for Japan
    ],
    size=len(text),
)

train_dataset = [(text, label), ...]
validation_dataset = [...]
test_dataset = [...]

Here, you will train your tagger and evaluate its performance. You will train it through an instance of Trainer, which you get by calling create_trainer. After a training, you will get an instance of Recognizer which predicts character-based tags from given texts. You will evaluate the performance of your tagger using an instance of Metric as follows.

device = torch.device("cuda")

trainer = create_trainer()
recognizer = trainer(train_dataset, validation_dataset, device)

texts, ground_truths = zip(*test_dataset)

batch_size = 15
predictions = recognizer(texts, batch_size, device)

metric = Metric()
metric(predictions, ground_truths)

print(metric.get_scores())  # Display F1-score, Precision, Recall

Installation

pip install pytorch-partial-tagger

Documentation

For details about the pytorch-partial-tagger API, see the documentation.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_partial_tagger-0.1.16.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytorch_partial_tagger-0.1.16-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file pytorch_partial_tagger-0.1.16.tar.gz.

File metadata

  • Download URL: pytorch_partial_tagger-0.1.16.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for pytorch_partial_tagger-0.1.16.tar.gz
Algorithm Hash digest
SHA256 601a6a71b976040bed84cd77763da6ad203b3615a0e065c35f385e9d53ee3cc0
MD5 df1d258777877105bfdfa86989d79022
BLAKE2b-256 27a1363116563a51e96ab2243046f031e39de89dac706e6a13048992bf1e6e85

See more details on using hashes here.

File details

Details for the file pytorch_partial_tagger-0.1.16-py3-none-any.whl.

File metadata

File hashes

Hashes for pytorch_partial_tagger-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 095e602c51f0fd82c0b4123ba14af0894ac018c498764d161ceb81d41cae3709
MD5 46dc984fb7e3c14295173ca82fa4f597
BLAKE2b-256 c8de6812b5546f8e7aa8c83774fcb9051daa6afcc015aead6ba33db0d74be06f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page