A dataset utils repository based on tf.data. For tensorflow 2.x only!
Project description
datasets
A dataset utils repository based on tf.data. For tensorflow>=2.0 only!
Requirements
- python 3.6
- tensorflow>=2.0
Installation
pip install nlp-datasets
Usage
seq2seq models
These models has an source sequence x and an target sequence y.
from nlp_datasets import Seq2SeqDataset
from nlp_datasets import SpaceTokenizer
from nlp_datasets.utils import data_dir_utils as utils
files = [
utils.get_data_file('iwslt15.tst2013.100.envi'),
utils.get_data_file('iwslt15.tst2013.100.envi'),
]
x_tokenizer = SpaceTokenizer()
x_tokenizer.build_from_corpus([utils.get_data_file('iwslt15.tst2013.100.en')])
y_tokenizer = SpaceTokenizer()
y_tokenizer.build_from_corpus([utils.get_data_file('iwslt15.tst2013.100.vi')])
config = {
'train_batch_size': 2,
'predict_batch_size': 2,
'eval_batch_size': 2,
'buffer_size': 100
}
dataset = Seq2SeqDataset(x_tokenizer, y_tokenizer, config)
train_dataset = dataset.build_train_dataset(files)
print(next(iter(train_dataset)))
print('=' * 120)
eval_dataset = dataset.build_eval_dataset(files)
print(next(iter(eval_dataset)))
print('=' * 120)
predict_files = [utils.get_data_file('iwslt15.tst2013.100.envi')]
predict_dataset = dataset.build_predict_dataset(predict_files)
print(next(iter(predict_dataset)))
print('=' * 120)
sequence match models
These models has two sequences as input, x and y, and has an label z.
from nlp_datasets import SeqMatchDataset
from nlp_datasets import SpaceTokenizer
from nlp_datasets.utils import data_dir_utils as utils
files = [
utils.get_data_file('dssm.query.doc.label.txt'),
utils.get_data_file('dssm.query.doc.label.txt'),
]
x_tokenizer = SpaceTokenizer()
x_tokenizer.build_from_vocab(utils.get_data_file('dssm.vocab.txt'))
y_tokenizer = SpaceTokenizer()
y_tokenizer.build_from_vocab(utils.get_data_file('dssm.vocab.txt'))
config = {
'train_batch_size': 2,
'eval_batch_size': 2,
'predict_batch_size': 2,
'buffer_size': 100,
}
dataset = SeqMatchDataset(x_tokenizer, y_tokenizer, config)
train_dataset = dataset.build_train_dataset(files)
print(next(iter(train_dataset)))
print('=' * 120)
eval_dataset = dataset.build_eval_dataset(files)
print(next(iter(eval_dataset)))
print('=' * 120)
predict_files = [utils.get_data_file('dssm.query.doc.label.txt')]
predict_dataset = dataset.build_predict_dataset(predict_files)
print(next(iter(predict_dataset)))
print('=' * 120)
sequence classify model
These models has a input sequence x, and a output label y.
from nlp_datasets import SeqClassifyDataset
from nlp_datasets import SpaceTokenizer
from nlp_datasets.utils import data_dir_utils as utils
files = [
utils.get_data_file('classify.seq.label.txt')
]
x_tokenizer = SpaceTokenizer()
x_tokenizer.build_from_corpus([utils.get_data_file('classify.seq.txt')])
config = {
'train_batch_size': 2,
'eval_batch_size': 2,
'predict_batch_size': 2,
'buffer_size': 100
}
dataset = SeqClassifyDataset(x_tokenizer, config)
train_dataset = dataset.build_train_dataset(files)
print(next(iter(train_dataset)))
print('=' * 120)
eval_dataset = dataset.build_eval_dataset(files)
print(next(iter(eval_dataset)))
print('=' * 120)
predict_files = [utils.get_data_file('classify.seq.txt')]
predict_dataset = dataset.build_predict_dataset(predict_files)
print(next(iter(predict_dataset)))
print('=' * 120)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlp_datasets-1.3.0.tar.gz.
File metadata
- Download URL: nlp_datasets-1.3.0.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c8f94180cd245587756ca55fa45d20093b82cc469e81dec4fa7cd4f443e48aa
|
|
| MD5 |
b6769c9b10c8d05ec8d9b57d437600ad
|
|
| BLAKE2b-256 |
fcf7516d1a01b83d15fc299842452d686c68d1d89dfd53747db260d22daa0bfa
|
File details
Details for the file nlp_datasets-1.3.0-py3-none-any.whl.
File metadata
- Download URL: nlp_datasets-1.3.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94763d7aed480fe08b613e65045622b6cb8be34f145dbdbd342389a5cb7cd4ad
|
|
| MD5 |
5b325c9f771ad388eb93b49fd1a4014c
|
|
| BLAKE2b-256 |
694077c7748eaa1a22f478bc0f86092a696708cb371420026637605d85bc92a9
|