Skip to main content

Simple PyTorch-based tagger

Project description

https://img.shields.io/pypi/v/pttagger.svg https://img.shields.io/travis/jun-harashima/pttagger.svg

pttagger is a simple PyTorch-based tagger which has the following features:

  • bi-directional LSTM

  • variable-sized mini-batches

  • multiple inputs

Quick Start

Installation

Run this command in your terminal:

$ pip install pttagger

Pre-processing

Suppose that you have the following examples of named entity recognition:

  • Joe/B-PER Smith/I-PER goes/O to/O Japan/B-LOC ./O

  • Jane/B-PER Smith/I-PER belongs/O to/O Kyoto/B-ORG University/I-ORG ./O

First, give the examples to construct a Dataset object like this:

from pttagger.dataset import Dataset

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

You can also use multiple inputs as a value of Xs. In the following case, Xs has not only word information but also POS information:

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

Now, dataset has the following two indices:

  • x_to_index: e.g., [{'<PAD>': 0, '<UNK>': 1, 'Joe': 2, 'Doe': 3, ...}]

  • y_to_index: e.g., {'<PAD>': 0, '<UNK>': 1, 'B-PER': 2, 'I-PER': 3, ...}

If you use multiple inputs, x_to_index has indices for each input.

Training

Construct a Model object and train it as follows:

from pttagger.model import Model

EMBEDDING_DIMS = [100]  # if you use multiple inputs, set a dimension for each input
HIDDEN_DIMS = [10]  # the same as above
x_set_sizes = [len(x_to_index) for x_to_index in dataset.x_to_index]
y_set_size = len(dataset.y_to_index)
model = Model(EMBEDDING_DIMS, HIDDEN_DIMS, x_set_sizes, y_set_size)
model.train(dataset)

Test

Predict tags for test examples like this:

test_examples = [
    {'Xs': [['Richard', 'Roe', 'comes', 'to', 'America', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']}
]
test_dataset = Dataset(test_examples)
results = model.test(dataset)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pttagger-0.1.0.tar.gz (12.9 kB view details)

Uploaded Source

File details

Details for the file pttagger-0.1.0.tar.gz.

File metadata

  • Download URL: pttagger-0.1.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for pttagger-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e926041c5d9b26a673b4160efd55ae1f786fa1caae6c53d496fe272aa541e39c
MD5 148821d46f0ab577af2dfd67b3cc49ec
BLAKE2b-256 a8fa1cedb79be3ef58eaef35c839db6a0f8a20f361bc8c5663872098a4981855

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page