A dataset utils repository based on tf.data. For tensorflow 2.x only!
Project description
datasets
A dataset utils repository based on tf.data
. For tensorflow>=2.0 only!
Requirements
- python 3.6
- tensorflow>=2.0
Installation
pip install nlp-datasets
Usage
seq2seq models
These models has an source sequence x
and an target sequence y
.
from nlp_datasets import Seq2SeqDataset from nlp_datasets import SpaceTokenizer from nlp_datasets.utils import data_dir_utils as utils files = [ utils.get_data_file('iwslt15.tst2013.100.envi'), utils.get_data_file('iwslt15.tst2013.100.envi'), ] x_tokenizer = SpaceTokenizer() x_tokenizer.build_from_corpus([utils.get_data_file('iwslt15.tst2013.100.en')]) y_tokenizer = SpaceTokenizer() y_tokenizer.build_from_corpus([utils.get_data_file('iwslt15.tst2013.100.vi')]) config = { 'train_batch_size': 2, 'predict_batch_size': 2, 'eval_batch_size': 2, 'buffer_size': 100 } dataset = Seq2SeqDataset(x_tokenizer, y_tokenizer, config) train_dataset = dataset.build_train_dataset(files) print(next(iter(train_dataset))) print('=' * 120) eval_dataset = dataset.build_eval_dataset(files) print(next(iter(eval_dataset))) print('=' * 120) predict_files = [utils.get_data_file('iwslt15.tst2013.100.envi')] predict_dataset = dataset.build_predict_dataset(predict_files) print(next(iter(predict_dataset))) print('=' * 120)
sequence match models
These models has two sequences as input, x
and y
, and has an label z
.
from nlp_datasets import SeqMatchDataset from nlp_datasets import SpaceTokenizer from nlp_datasets.utils import data_dir_utils as utils files = [ utils.get_data_file('dssm.query.doc.label.txt'), utils.get_data_file('dssm.query.doc.label.txt'), ] x_tokenizer = SpaceTokenizer() x_tokenizer.build_from_vocab(utils.get_data_file('dssm.vocab.txt')) y_tokenizer = SpaceTokenizer() y_tokenizer.build_from_vocab(utils.get_data_file('dssm.vocab.txt')) config = { 'train_batch_size': 2, 'eval_batch_size': 2, 'predict_batch_size': 2, 'buffer_size': 100, } dataset = SeqMatchDataset(x_tokenizer, y_tokenizer, config) train_dataset = dataset.build_train_dataset(files) print(next(iter(train_dataset))) print('=' * 120) eval_dataset = dataset.build_eval_dataset(files) print(next(iter(eval_dataset))) print('=' * 120) predict_files = [utils.get_data_file('dssm.query.doc.label.txt')] predict_dataset = dataset.build_predict_dataset(predict_files) print(next(iter(predict_dataset))) print('=' * 120)
sequence classify model
These models has a input sequence x
, and a output label y
.
from nlp_datasets import SeqClassifyDataset from nlp_datasets import SpaceTokenizer from nlp_datasets.utils import data_dir_utils as utils files = [ utils.get_data_file('classify.seq.label.txt') ] x_tokenizer = SpaceTokenizer() x_tokenizer.build_from_corpus([utils.get_data_file('classify.seq.txt')]) config = { 'train_batch_size': 2, 'eval_batch_size': 2, 'predict_batch_size': 2, 'buffer_size': 100 } dataset = SeqClassifyDataset(x_tokenizer, config) train_dataset = dataset.build_train_dataset(files) print(next(iter(train_dataset))) print('=' * 120) eval_dataset = dataset.build_eval_dataset(files) print(next(iter(eval_dataset))) print('=' * 120) predict_files = [utils.get_data_file('classify.seq.txt')] predict_dataset = dataset.build_predict_dataset(predict_files) print(next(iter(predict_dataset))) print('=' * 120)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlp_datasets-1.3.0.tar.gz
(8.6 kB
view hashes)
Built Distribution
Close
Hashes for nlp_datasets-1.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94763d7aed480fe08b613e65045622b6cb8be34f145dbdbd342389a5cb7cd4ad |
|
MD5 | 5b325c9f771ad388eb93b49fd1a4014c |
|
BLAKE2-256 | 694077c7748eaa1a22f478bc0f86092a696708cb371420026637605d85bc92a9 |