pttagger

Simple PyTorch-based tagger

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

https://img.shields.io/pypi/v/pttagger.svg

https://img.shields.io/travis/jun-harashima/pttagger.svg

pttagger is a simple PyTorch-based tagger which has the following features:

stacked bi-directional RNN (GRU or LSTM)
variable-sized mini-batches
multiple inputs

Quick Start

Installation

Run this command in your terminal:

$ pip install pttagger

Pre-processing

Suppose that you have the following examples of named entity recognition:

Joe/B-PER Smith/I-PER goes/O to/O Japan/B-LOC ./O
Jane/B-PER Smith/I-PER belongs/O to/O Kyoto/B-ORG University/I-ORG ./O
…

First, give the examples to construct a Dataset object like this:

from pttagger.dataset import Dataset

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

You can also use multiple inputs as a value of Xs. In the following case, Xs has not only word information but also POS information:

examples = [
    {'Xs': [['Joe', 'Doe', 'goes', 'to', 'Japan', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']},
    {'Xs': [['Jane', 'Doe', 'belongs', 'to', 'Kyoto', 'University', '.'], ['NNP', 'NNP', 'VBZ', 'TO', 'NNP', 'NNP', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O']},
    ...
]
dataset = Dataset(examples)

Now, dataset has the following two indices:

x_to_index: e.g., [{'<PAD>': 0, '<UNK>': 1, 'Joe': 2, 'Doe': 3, ...}]
y_to_index: e.g., {'<PAD>': 0, '<UNK>': 1, 'B-PER': 2, 'I-PER': 3, ...}

If you use multiple inputs, x_to_index has indices for each input.

Training

Construct a Model object and train it as follows:

from pttagger.model import Model

EMBEDDING_DIMS = [100]  # if you use multiple inputs, set a dimension for each input
HIDDEN_DIMS = [10]  # the same as above
x_set_sizes = [len(x_to_index) for x_to_index in dataset.x_to_index]
y_set_size = len(dataset.y_to_index)
model = Model(EMBEDDING_DIMS, HIDDEN_DIMS, x_set_sizes, y_set_size)
model.train(dataset)

You can also use the following parameters:

use_lstm - If True, uses LSTM. Default: False (uses GRU)
num_layers - Number of recurrent layers. Default: 1

Test

Predict tags for test examples like this:

test_examples = [
    {'Xs': [['Richard', 'Roe', 'comes', 'to', 'America', '.']],
     'Y': ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'O']}
]
test_dataset = Dataset(test_examples)
results = model.test(dataset)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.1.1

Jan 13, 2019

0.1.0

Jan 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pttagger-0.1.1.tar.gz (13.1 kB view hashes)

Uploaded Jan 13, 2019 Source

Hashes for pttagger-0.1.1.tar.gz

Hashes for pttagger-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3004508163422ca35093effde4fd6d7997bb5bc9bfaf4971a958dfeed1635a66`
MD5	`463b04ad655047ba31121de9da194e34`
BLAKE2b-256	`7dc39962a15d5717c9844dbb8ac2f6c5c16f5cd36e031286f169df8baa38f6f7`