Skip to main content

Simple PyTorch-based tagger

Project description

https://img.shields.io/pypi/v/pttagger.svg https://img.shields.io/travis/jun-harashima/pttagger.svg

pttagger is a simple PyTorch-based tagger which has the following features:

  • stacked bi-directional RNN (GRU or LSTM)

  • variable-sized mini-batches

  • multiple inputs

Quick Start

Installation

Run this command in your terminal:

$ pip install pttagger

Pre-processing

Suppose that you have the following examples of named entity recognition:

  • Joe/B-PER Smith/I-PER goes/O to/O Japan/B-LOC ./O

  • Jane/B-PER Smith/I-PER belongs/O to/O Kyoto/B-ORG University/I-ORG ./O

First, give the examples to construct a Dataset object like this:

from pttagger.dataset import Dataset

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

You can also use multiple inputs as a value of Xs. In the following case, Xs has not only word information but also POS information:

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

Now, dataset has the following two indices:

  • x_to_index: e.g., [{'<PAD>': 0, '<UNK>': 1, 'Joe': 2, 'Doe': 3, ...}]

  • y_to_index: e.g., {'<PAD>': 0, '<UNK>': 1, 'B-PER': 2, 'I-PER': 3, ...}

If you use multiple inputs, x_to_index has indices for each input.

Training

Construct a Model object and train it as follows:

from pttagger.model import Model

EMBEDDING_DIMS = [100]  # if you use multiple inputs, set a dimension for each input
HIDDEN_DIMS = [10]  # the same as above
x_set_sizes = [len(x_to_index) for x_to_index in dataset.x_to_index]
y_set_size = len(dataset.y_to_index)
model = Model(EMBEDDING_DIMS, HIDDEN_DIMS, x_set_sizes, y_set_size)
model.train(dataset)

You can also use the following parameters:

  • use_lstm - If True, uses LSTM. Default: False (uses GRU)

  • num_layers - Number of recurrent layers. Default: 1

Test

Predict tags for test examples like this:

test_examples = [
    {'Xs': [['Richard', 'Roe', 'comes', 'to', 'America', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']}
]
test_dataset = Dataset(test_examples)
results = model.test(dataset)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pttagger-0.1.1.tar.gz (13.1 kB view details)

Uploaded Source

File details

Details for the file pttagger-0.1.1.tar.gz.

File metadata

  • Download URL: pttagger-0.1.1.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for pttagger-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3004508163422ca35093effde4fd6d7997bb5bc9bfaf4971a958dfeed1635a66
MD5 463b04ad655047ba31121de9da194e34
BLAKE2b-256 7dc39962a15d5717c9844dbb8ac2f6c5c16f5cd36e031286f169df8baa38f6f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page