Skip to main content

Simple PyTorch-based tagger

Project description

https://img.shields.io/pypi/v/pttagger.svg https://img.shields.io/travis/jun-harashima/pttagger.svg

pttagger is a simple PyTorch-based tagger which has the following features:

  • stacked bi-directional RNN (GRU or LSTM)

  • variable-sized mini-batches

  • multiple inputs

Quick Start

Installation

Run this command in your terminal:

$ pip install pttagger

Pre-processing

Suppose that you have the following examples of named entity recognition:

  • Joe/B-PER Smith/I-PER goes/O to/O Japan/B-LOC ./O

  • Jane/B-PER Smith/I-PER belongs/O to/O Kyoto/B-ORG University/I-ORG ./O

First, give the examples to construct a Dataset object like this:

from pttagger.dataset import Dataset

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

You can also use multiple inputs as a value of Xs. In the following case, Xs has not only word information but also POS information:

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

Now, dataset has the following two indices:

  • x_to_index: e.g., [{'<PAD>': 0, '<UNK>': 1, 'Joe': 2, 'Doe': 3, ...}]

  • y_to_index: e.g., {'<PAD>': 0, '<UNK>': 1, 'B-PER': 2, 'I-PER': 3, ...}

If you use multiple inputs, x_to_index has indices for each input.

Training

Construct a Model object and train it as follows:

from pttagger.model import Model

EMBEDDING_DIMS = [100]  # if you use multiple inputs, set a dimension for each input
HIDDEN_DIMS = [10]  # the same as above
x_set_sizes = [len(x_to_index) for x_to_index in dataset.x_to_index]
y_set_size = len(dataset.y_to_index)
model = Model(EMBEDDING_DIMS, HIDDEN_DIMS, x_set_sizes, y_set_size)
model.train(dataset)

You can also use the following parameters:

  • use_lstm - If True, uses LSTM. Default: False (uses GRU)

  • num_layers - Number of recurrent layers. Default: 1

Test

Predict tags for test examples like this:

test_examples = [
    {'Xs': [['Richard', 'Roe', 'comes', 'to', 'America', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']}
]
test_dataset = Dataset(test_examples)
results = model.test(dataset)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pttagger-0.1.1.tar.gz (13.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page