Skip to main content

La**S**t **mile** datasets: Use `tf.data` to solve the last mile data loading problem for tensorflow.

Project description

smile-datasets

Python package PyPI version Python

LaSt mile Datasets: Use tf.data to solve the last mile data loading problem for tensorflow.

If you want to load public datasets, try:

If you want to load local, personal dataset with minimized boilerplate, use Smile Dataset!

Support Matrix

task supported core abstractions
question answering [x] ExampleForQuestionAnswering, DatasetForQuestionAnswering, DatapipeForQuestionAnswering
masked language model [x] ExampleForMaskedLanguageModel, DatasetForMaskedLanguageModel, DatapipeForMaskedLanguageModel
sequence classification [x] ExampleForSequenceClassification, DatasetForSequenceClassification, DatapipeForSequenceClassification
token classification [x] ExampleForTokenClassification, DatasetForTokenClassification, DatapipeForTokenClassification
unsupervised simcse [x] ExampleForUnsupervisedSimCSE, DatasetForUnsupervisedSimCSE, DatapipeForUnsupervisedSimCSE
supervised simcse [x] ExampleForSupervisedSimCSE, DatasetForSupervisedSimCSE, DatapipeForSupervisedSimCSE
hard negative simcse [x] ExampleForHardNegativeSimCSE, DatasetForHardNegativeSimCSE, DatapipeForHardNegativeSimCSE

Usage

All datapipes for different tasks has the same interface.

Here is an example for question answering task, but you can use datapipe the same way for other tasks.

Example for Question Answering

from smile_datasets import DatasetForQuestionAnswering, DatapipeForQuestionAnswering

# each line is a JSON {"sequece": "我喜欢自然语言处理(NLP)"}
train_input_jsonl_files = ["data/train.jsonl"]
train_dataset = DatapipeForQuestionAnswering.from_jsonl_files(
    input_files=train_input_jsonl_files, 
    vocab_file="bert/vocab.txt",
    batch_size=32,
)

# check dataset
print(next(iter(train_dataset)))

# model = build_keras_model(...)
# model.compile(...)
# train model
model.fit(train_dataset, callbacks=[...])

For maximum flexibility, you can always subclass DatasetForQuestionAnswering to load your dataset, just like torch.utils.data.Dataset:

from smile_datasets import DatasetForQuestionAnswering, DatapipeForQuestionAnswering, ParserForQuestionAnswering

class DuReaderDatasetForQuestionAnswering(DatasetForQuestionAnswering):
    """Dataset reader for DuReader dataset."""

    def __init__(self, input_files, vocab_file, subset="rubost", **kwargs) -> None:
        super().__init__()
        self.parser = ParserForQuestionAnswering(tokenizer=None, vocab_file=vocab_file, **kwargs)
        if subset == "rubost":
            self.instances = list(readers.read_dureader_rubost(input_files, **kwargs))
        else:
            self.instances = list(readers.read_dureader_checklist(input_files, **kwargs))
        self.examples = []
        for instance in self.instances:
            e = self.parser.parse(instance)
            if not e:
                continue
            self.examples.append(e)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, index) -> ExampleForQuestionAnswering:
        return self.examples[index]


dataset = DuReaderDatasetForQuestionAnswering(input_files=["data/trian.jsonl"], vocab_file="bert/vocab.txt")
train_dataset = DatapipeForQuestionAnswering.from_dataset(dataset, batch_size=32)

# check dataset
print(next(iter(train_dataset)))

# model = build_keras_model(...)
# model.compile(...)
# train model
model.fit(train_dataset, callbacks=[...])

For better performance, you can convert dataset to tfrecord ahead of time, and then build datapipe from tfrecord files directly:

# save dataset in tfrecord format
dataset.save_tfrecord(output_files="data/train.tfrecord")

# build datapipe from tfrecord files
train_dataset = DatapipeForQuestionAnswering.from_tfrecord_files(input_files="data/train.tfrecord", batch_size=32)

# check dataset
print(next(iter(train_dataset)))

# model = build_keras_model(...)
# model.compile(...)
# train model
model.fit(train_dataset, callbacks=[...])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smile-datasets-0.0.6.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

smile_datasets-0.0.6-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file smile-datasets-0.0.6.tar.gz.

File metadata

  • Download URL: smile-datasets-0.0.6.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for smile-datasets-0.0.6.tar.gz
Algorithm Hash digest
SHA256 786b4fd408e02c77dda4dcf725ddc4928a73b048e453c03e56b138cae8f7059b
MD5 a8d77df596e6fcf240bbb5edacf783fa
BLAKE2b-256 7228bdee1d8fadf99f99daea01b0a09546932871fb995e82838c6c59e5c94ffb

See more details on using hashes here.

File details

Details for the file smile_datasets-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: smile_datasets-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for smile_datasets-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 bb12a3f674f309c8b2841251ebc02926c7eb4aa42a82a4deb6aa1589994547f6
MD5 515b6dd0fe79d1b44736de393dc87ab1
BLAKE2b-256 a37c1ce87e87ebc2267591494e0fb3dc56f49646f069c14666e8cb7787262466

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page