La**S**t **mile** datasets: Use `tf.data` to solve the last mile data loading problem for tensorflow.
Project description
smile-datasets
LaSt mile Datasets: Use tf.data
to solve the last mile data loading problem for tensorflow.
If you want to load public datasets, try:
If you want to load local, personal dataset with minimized boilerplate, use Smile Dataset!
Support Matrix
task | supported | core abstractions |
---|---|---|
question answering | [x] | ExampleForQuestionAnswering , DatasetForQuestionAnswering , DatapipeForQuestionAnswering |
masked language model | [x] | ExampleForMaskedLanguageModel , DatasetForMaskedLanguageModel , DatapipeForMaskedLanguageModel |
sequence classification | [x] | ExampleForSequenceClassification , DatasetForSequenceClassification , DatapipeForSequenceClassification |
token classification | [x] | ExampleForTokenClassification , DatasetForTokenClassification , DatapipeForTokenClassification |
unsupervised simcse | [x] | ExampleForUnsupervisedSimCSE , DatasetForUnsupervisedSimCSE , DatapipeForUnsupervisedSimCSE |
supervised simcse | [x] | ExampleForSupervisedSimCSE , DatasetForSupervisedSimCSE , DatapipeForSupervisedSimCSE |
hard negative simcse | [x] | ExampleForHardNegativeSimCSE , DatasetForHardNegativeSimCSE , DatapipeForHardNegativeSimCSE |
Usage
All datapipes for different tasks has the same interface.
Here is an example for question answering task, but you can use datapipe the same way for other tasks.
Example for Question Answering
from smile_datasets import DatasetForQuestionAnswering, DatapipeForQuestionAnswering
# each line is a JSON {"sequece": "我喜欢自然语言处理(NLP)"}
train_input_jsonl_files = ["data/train.jsonl"]
train_dataset = DatapipeForQuestionAnswering.from_jsonl_files(
input_files=train_input_jsonl_files,
vocab_file="bert/vocab.txt",
batch_size=32,
)
# check dataset
print(next(iter(train_dataset)))
# model = build_keras_model(...)
# model.compile(...)
# train model
model.fit(train_dataset, callbacks=[...])
For maximum flexibility, you can always subclass DatasetForQuestionAnswering
to load your dataset, just like torch.utils.data.Dataset
:
from smile_datasets import DatasetForQuestionAnswering, DatapipeForQuestionAnswering, ParserForQuestionAnswering
class DuReaderDatasetForQuestionAnswering(DatasetForQuestionAnswering):
"""Dataset reader for DuReader dataset."""
def __init__(self, input_files, vocab_file, subset="rubost", **kwargs) -> None:
super().__init__()
self.parser = ParserForQuestionAnswering(tokenizer=None, vocab_file=vocab_file, **kwargs)
if subset == "rubost":
self.instances = list(readers.read_dureader_rubost(input_files, **kwargs))
else:
self.instances = list(readers.read_dureader_checklist(input_files, **kwargs))
self.examples = []
for instance in self.instances:
e = self.parser.parse(instance)
if not e:
continue
self.examples.append(e)
def __len__(self):
return len(self.examples)
def __getitem__(self, index) -> ExampleForQuestionAnswering:
return self.examples[index]
dataset = DuReaderDatasetForQuestionAnswering(input_files=["data/trian.jsonl"], vocab_file="bert/vocab.txt")
train_dataset = DatapipeForQuestionAnswering.from_dataset(dataset, batch_size=32)
# check dataset
print(next(iter(train_dataset)))
# model = build_keras_model(...)
# model.compile(...)
# train model
model.fit(train_dataset, callbacks=[...])
For better performance, you can convert dataset
to tfrecord
ahead of time, and then build datapipe from tfrecord files directly:
# save dataset in tfrecord format
dataset.save_tfrecord(output_files="data/train.tfrecord")
# build datapipe from tfrecord files
train_dataset = DatapipeForQuestionAnswering.from_tfrecord_files(input_files="data/train.tfrecord", batch_size=32)
# check dataset
print(next(iter(train_dataset)))
# model = build_keras_model(...)
# model.compile(...)
# train model
model.fit(train_dataset, callbacks=[...])
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for smile_datasets-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb12a3f674f309c8b2841251ebc02926c7eb4aa42a82a4deb6aa1589994547f6 |
|
MD5 | 515b6dd0fe79d1b44736de393dc87ab1 |
|
BLAKE2b-256 | a37c1ce87e87ebc2267591494e0fb3dc56f49646f069c14666e8cb7787262466 |