Sequence Tagger for Partially Annotated Dataset in PyTorch
Project description
pytorch-partial-tagger
This is a library to build a CRF tagger for a partially annotated dataset in PyTorch. You can build your own NER tagger only from dictionary. The algorithm of this tagger is based on Effland and Collins. (2021).
Usage
Import all dependencies first:
import torch
from partial_tagger.data import CharBasedTags
from partial_tagger.training import Trainer
from partial_tagger.utils import Metric, create_tag
Prepare your own datasets.
Each item of dataset must have a string and tags. A string represents text
below.
Tags represent a collection of tags, where each tag has a start, a length, and a label, which are defined as tags
below.
A start represents a position in text
where a tag starts.
A length represents a distance in text
between the beginning of a tag and the end of a tag.
A label represents what you want to assign to a span of text
defined by a start and a length.
from partial_tagger.utils import create_tags, CharBasedTags
text = "Tokyo is the capital of Japan."
tags = CharBasedTags(
(
create_tag(start=0, length=5, label="LOC"), # Tag for Tokyo
create_tag(start=24, length=5, label="LOC") # Tag for Japan
),
text
)
train_dataset = [(text, tags), ...]
validation_dataset = [...]
test_dataset = [...]
Here, you would train your tagger and evaluate its performance.
You could train your own tagger by initializing Trainer
and passing datasets to it.
After training, trainer
gives you Recognizer
object which predicts character-based tags from given texts.
You could evaluate the performance of your tagger using Metric
as below.
device = torch.device("cuda")
trainer = Trainer()
recognizer = trainer(train_dataset, validation_dataset, device)
texts, ground_truths = zip(*test_dataset, strict=True)
batch_size = 15
predictions = recognizer(texts, batch_size, device)
metric = Metric()
metric(predictions, ground_truths)
print(metric.get_scores()) # Display F1-score, Precision, Recall
Installation
pip install pytorch-partial-tagger
References
- Thomas Effland and Michael Collins. 2021. Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss. Transactions of the Association for Computational Linguistics, 9:1320–1335.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pytorch_partial_tagger-0.1.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3c4f0363e4bcf2e2a1cd51ded57a1526dc6f7206683f1264753a7d992cd1954 |
|
MD5 | 0ab08710faa95d1d3a8f180fa4ffa506 |
|
BLAKE2b-256 | b453b646f4057d1d349ac41c258f2df4faa589207c1780336b06f0a59f2b9791 |
Hashes for pytorch_partial_tagger-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85f38152afdbbe45a3491b57b43bdc476976ba3adcadf8bb0a96dba8c10683cd |
|
MD5 | c72d39a4a1b55c8c5dbf0715b33168ff |
|
BLAKE2b-256 | 5b12cce6f661bcfcbbfb1a3d588cd7006885fca0947d40e0f5d641d187698c7b |