Text utilities and datasets for generic deep learning, Fork from torchtext but uses numpy to store datasets for more generic use.
Project description
This repository consists of:
pytext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
pytext.datasets: Pre-built loaders for common NLP datasets
It is a fork of torchtext, but use numpy ndarray for dataset instead of torch.Tensor or Variable, so as to make it a more generic toolbox for NLP users.
Installation
Make sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip:
pip install pytexttool
Optional requirements
If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:
pip install spacy python -m spacy download en
Alternatively, you might want to use Moses tokenizer from NLTK. You have to install NLTK and download the data needed:
pip install nltk python -m nltk.downloader perluniprops nonbreaking_prefixes
Data
The data module provides the following:
Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:
>>> pos = data.TabularDataset( ... path='data/pos/pos_wsj_train.tsv', format='tsv', ... fields=[('text', data.Field()), ... ('labels', data.Field())]) ... >>> sentiment = data.TabularDataset( ... path='data/sentiment/train.json', format='json', ... fields={'sentence_tokenized': ('text', data.Field(sequential=True)), ... 'sentiment_gold': ('labels', data.Field(sequential=False))})
Ability to define a preprocessing pipeline:
>>> src = data.Field(tokenize=my_custom_tokenizer) >>> trg = data.Field(tokenize=my_custom_tokenizer) >>> mt_train = datasets.TranslationDataset( ... path='data/mt/wmt16-ende.train', exts=('.en', '.de'), ... fields=(src, trg))
Batching, padding, and numericalizing (including building a vocabulary object):
>>> # continuing from above >>> mt_dev = datasets.TranslationDataset( ... path='data/mt/newstest2014', exts=('.en', '.de'), ... fields=(src, trg)) >>> src.build_vocab(mt_train, max_size=80000) >>> trg.build_vocab(mt_train, max_size=40000) >>> # mt_dev shares the fields, so it shares their vocab objects >>> >>> train_iter = data.BucketIterator( ... dataset=mt_train, batch_size=32, ... sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg))) >>> # usage >>> next(iter(train_iter)) <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
Wrapper for dataset splits (train, validation, test):
>>> TEXT = data.Field() >>> LABELS = data.Field() >>> >>> train, val, test = data.TabularDataset.splits( ... path='/data/pos_wsj/pos_wsj', train='_train.tsv', ... validation='_dev.tsv', test='_test.tsv', format='tsv', ... fields=[('text', TEXT), ('labels', LABELS)]) >>> >>> train_iter, val_iter, test_iter = data.BucketIterator.splits( ... (train, val, test), batch_sizes=(16, 256, 256), >>> sort_key=lambda x: len(x.text), device=0) >>> >>> TEXT.build_vocab(train) >>> LABELS.build_vocab(train)
Datasets
The datasets module currently contains:
Sentiment analysis: SST and IMDb
Question classification: TREC
Entailment: SNLI
Language modeling: abstract class + WikiText-2
Machine translation: abstract class + Multi30k, IWSLT, WMT14
Sequence tagging (e.g. POS/NER): abstract class + UDPOS
Others are planned or a work in progress:
Question answering: SQuAD
See the test directory for examples of dataset usage.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pytexttool-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ba8bef67e8698b04374fed400fd4f64b6ce460737bf4228d04a11c3b2ec2c14 |
|
MD5 | 1300de3e55bf1f90168f342989b83bf8 |
|
BLAKE2b-256 | 84e2c7ea7b4b65acd7b63917699cf2a5a9dc6d6430148e0d799cf41b81dc9969 |
Hashes for pytexttool-0.3.1-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | abd399bb54948264fb9416048ea0c0fdf967dee98513e82e22862dbb6e3986cd |
|
MD5 | 46f272a1c6d492eae96273f1384f4b73 |
|
BLAKE2b-256 | 861ee42c90663c9840425a6b640d4808c0507b137b1dcb7c3d2a73f362eb627e |