Skip to main content

Data pipelines for TensorFlow and PyTorch.

Project description

rapidnlp-datasets

Python package PyPI version Python

Data pipelines for both TensorFlow and PyTorch !

If you want to load public datasets, try:

If you want to load local, personal dataset with minimized boilerplate, use rapidnlp-datasets!

installation

pip install -U rapidnlp-datasets

If you work with PyTorch, you should install PyTorch first.

If you work with TensorFlow, you should install TensorFlow first.

Usage

Here are few examples to show you how to use this library.

sequence-classification-quickstart

In PyTorch,

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForSequenceClassification
>>> dataset = DatasetForSequenceClassification.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForSequenceClassifiation
>>> dataset, d = TFDatasetForSequenceClassifiation.from_jsonl_files(
        input_files=["testdata/sequence_classification.jsonl"],
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/sequence_classification.tfrecord")
2021-12-08 14:52:41,295    INFO             utils.py  128] Finished to write 2 examples to tfrecords.
>>> dataset = TFDatasetForSequenceClassifiation.from_tfrecord_files("testdata/sequence_classification.tfrecord")
>>> for idx, batch in enumerate(iter(dataset)):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

question-answering-quickstart

In PyTorch:

>>> import torch
>>> from rapidnlp_datasets.pt import DatasetForQuestionAnswering
>>>
>>> dataset = DatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
    )
>>> dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=32, collate_fn=dataset.batch_padding_collate)
>>> for idx, batch in enumerate(dataloader):
...     print("No.{} batch: \n{}".format(idx, batch))
... 

In TensorFlow,

>>> from rapidnlp_datasets.tf import TFDatasetForQuestionAnswering
>>> dataset, d = TFDatasetForQuestionAnswering.from_jsonl_files(
        input_files="testdata/qa.jsonl",
        vocab_file="testdata/vocab.txt",
        return_self=True,
    )
2021-12-08 15:09:06,747    INFO question_answering_dataset.py  101] Read 3 examples in total.
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

Especially, you can save dataset to tfrecord format when working with TensorFlow, and then build dataset from tfrecord files directly!

>>> d.save_tfrecord("testdata/qa.tfrecord")
2021-12-08 15:09:31,329    INFO             utils.py  128] Finished to write 3 examples to tfrecords.
>>> dataset = TFDatasetForQuestionAnswering.from_tfrecord_files(
        "testdata/qa.tfrecord",
        batch_size=32,
        padding="batch",
    )
>>> for idx, batch in enumerate(iter(dataset)):
        print()
        print("NO.{} batch: \n{}".format(idx, batch))
... 

token-classification-quickstart

masked-language-models-quickstart

simcse-quickstart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rapidnlp-datasets-0.1.0.tar.gz (22.6 kB view hashes)

Uploaded Source

Built Distribution

rapidnlp_datasets-0.1.0-py3-none-any.whl (41.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page