Skip to main content

Text utilities and datasets for generic deep learning, Fork from torchtext but uses numpy to store datasets for more generic use.

Project description

This repository consists of:

  • pytext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)

  • pytext.datasets: Pre-built loaders for common NLP datasets

It is a fork of torchtext, but use numpy ndarray for dataset instead of torch.Tensor or Variable, so as to make it a more generic toolbox for NLP users.

Installation

Make sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip:

pip install pytexttool

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use Moses tokenizer from NLTK. You have to install NLTK and download the data needed:

pip install nltk
python -m nltk.downloader perluniprops nonbreaking_prefixes

Data

The data module provides the following:

  • Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

    >>> pos = data.TabularDataset(
    ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
    ...    fields=[('text', data.Field()),
    ...            ('labels', data.Field())])
    ...
    >>> sentiment = data.TabularDataset(
    ...    path='data/sentiment/train.json', format='json',
    ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
    ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
  • Ability to define a preprocessing pipeline:

    >>> src = data.Field(tokenize=my_custom_tokenizer)
    >>> trg = data.Field(tokenize=my_custom_tokenizer)
    >>> mt_train = datasets.TranslationDataset(
    ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    ...     fields=(src, trg))
  • Batching, padding, and numericalizing (including building a vocabulary object):

    >>> # continuing from above
    >>> mt_dev = datasets.TranslationDataset(
    ...     path='data/mt/newstest2014', exts=('.en', '.de'),
    ...     fields=(src, trg))
    >>> src.build_vocab(mt_train, max_size=80000)
    >>> trg.build_vocab(mt_train, max_size=40000)
    >>> # mt_dev shares the fields, so it shares their vocab objects
    >>>
    >>> train_iter = data.BucketIterator(
    ...     dataset=mt_train, batch_size=32,
    ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
    >>> # usage
    >>> next(iter(train_iter))
    <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
  • Wrapper for dataset splits (train, validation, test):

    >>> TEXT = data.Field()
    >>> LABELS = data.Field()
    >>>
    >>> train, val, test = data.TabularDataset.splits(
    ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
    ...     fields=[('text', TEXT), ('labels', LABELS)])
    >>>
    >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
    ...     (train, val, test), batch_sizes=(16, 256, 256),
    >>>     sort_key=lambda x: len(x.text), device=0)
    >>>
    >>> TEXT.build_vocab(train)
    >>> LABELS.build_vocab(train)

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb

  • Question classification: TREC

  • Entailment: SNLI

  • Language modeling: abstract class + WikiText-2

  • Machine translation: abstract class + Multi30k, IWSLT, WMT14

  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytexttool-0.3.1.tar.gz (48.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pytexttool-0.3.1-py3-none-any.whl (60.5 kB view details)

Uploaded Python 3

pytexttool-0.3.1-py2-none-any.whl (60.5 kB view details)

Uploaded Python 2

File details

Details for the file pytexttool-0.3.1.tar.gz.

File metadata

  • Download URL: pytexttool-0.3.1.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/36.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.1

File hashes

Hashes for pytexttool-0.3.1.tar.gz
Algorithm Hash digest
SHA256 a0718c0a6ac5a63266ebaff5780eba63e5b01d137b8f740adb3aab593a853915
MD5 36b1174cbf7129ff4b210cc4f2ebd6ee
BLAKE2b-256 17a73dae3ef0f428183ccca084fd7a73747c50b6222be0fdbab4a17542a8d416

See more details on using hashes here.

File details

Details for the file pytexttool-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pytexttool-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 60.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/36.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.1

File hashes

Hashes for pytexttool-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ba8bef67e8698b04374fed400fd4f64b6ce460737bf4228d04a11c3b2ec2c14
MD5 1300de3e55bf1f90168f342989b83bf8
BLAKE2b-256 84e2c7ea7b4b65acd7b63917699cf2a5a9dc6d6430148e0d799cf41b81dc9969

See more details on using hashes here.

File details

Details for the file pytexttool-0.3.1-py2-none-any.whl.

File metadata

  • Download URL: pytexttool-0.3.1-py2-none-any.whl
  • Upload date:
  • Size: 60.5 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/36.4.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.1

File hashes

Hashes for pytexttool-0.3.1-py2-none-any.whl
Algorithm Hash digest
SHA256 abd399bb54948264fb9416048ea0c0fdf967dee98513e82e22862dbb6e3986cd
MD5 46f272a1c6d492eae96273f1384f4b73
BLAKE2b-256 861ee42c90663c9840425a6b640d4808c0507b137b1dcb7c3d2a73f362eb627e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page