Skip to main content

Convert a collection of numpy data to tfrecord

Project description

numpy2tfrecord

Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset.

Installation

$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git
$ cd numpy2tfrecord
$ pip install .

or simply using pip:

$ pip install numpy2tfrecord

How to use

Convert a collection of numpy data to tfrecord

You can convert samples represented in the form of a dict to tf.train.Example and save them as a tfrecord.

import numpy as np
from numpy2tfrecord import Numpy2TFRecordConverter

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    x = np.arange(100).reshape(10, 10).astype(np.float32)  # float array
    y = np.arange(100).reshape(10, 10).astype(np.int64)  # int array
    a = 5  # int
    b = 0.3  # float
    sample = {"x": x, "y": y, "a": a, "b": b}
    converter.convert_sample(sample)  # convert data sample

You can also convert a list of samples at once using convert_list.

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    samples = [
        {
            "x": np.random.rand(64).astype(np.float32),
            "y": np.random.randint(0, 10),
        }
        for _ in range(32)
    ]  # list of 32 samples

    converter.convert_list(samples)

Or a batch of samples at once using convert_batch.

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    samples = {
        "x": np.random.rand(32, 64).astype(np.float32),
        "y": np.random.randint(0, 10, size=32).astype(np.int64),
    }  # batch of 32 samples

    converter.convert_batch(samples)

So what are the advantages of Numpy2TFRecordConverter compared to tf.data.datset.from_tensor_slices? Simply put, when using tf.data.dataset.from_tensor_slices, all the samples that will be converted to a dataset must be in memory. On the other hand, you can use Numpy2TFRecordConverter to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..

Build a tensorflow dataset from tfrecord

Samples once stored in the tfrecord can be streamed using tf.data.TFRecordDataset.

from numpy2tfrecord import build_dataset_from_tfrecord

dataset = build_dataset_from_tfrecord("test.tfrecord")

The dataset can then be used directly in the for-loop of machine learning.

for batch in dataset.as_numpy_iterator():
    x, y = batch.values()
    ...

Speeding up PyTorch data loading with numpy2tfrecord!

https://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01

import os
import time

import numpy as np
from numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord
import torch
from torchvision import datasets, transforms

dataset = datasets.MNIST(".", download=True, transform=transforms.ToTensor())

# convert to tfrecord
with Numpy2TfrecordConverter("mnist.tfrecord") as converter:
    converter.convert_batch({"x": dataset.data.numpy().astype(np.int64), 
                        "y": dataset.targets.numpy().astype(np.int64)})

torch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())
tic = time.time()
for e in range(5):
    for batch in torch_loader:
        x, y = batch
elapsed = time.time() - tic
print(f"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs")

tf_loader = build_dataset_from_tfrecord("mnist.tfrecord").batch(32).prefetch(1)
tic = time.time()
for e in range(5):
    for batch in tf_loader.as_numpy_iterator():
        x, y = batch.values()
elapsed = time.time() - tic
print(f"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs")

⬇️

elapsed time with pytorch dataloader: 41.10 sec for 5 epochs
elapsed time with tf dataloader: 17.34 sec for 5 epochs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

numpy2tfrecord-0.0.3.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

numpy2tfrecord-0.0.3-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file numpy2tfrecord-0.0.3.tar.gz.

File metadata

  • Download URL: numpy2tfrecord-0.0.3.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for numpy2tfrecord-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fa44db6cc26677f3886ef1c5dc0bda13f3cf390247907388ee62acf12035f111
MD5 7a899e98c894a8c67703416498dfc375
BLAKE2b-256 4e0b919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67

See more details on using hashes here.

File details

Details for the file numpy2tfrecord-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for numpy2tfrecord-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e21e3507f92c3c5e90633fe8483d466543c16b754f062269575aa460f2090ed7
MD5 0fd4dc0b35258a617895d58e104a6f80
BLAKE2b-256 8465265df14bfda999f279f34070b58a0f38df56cf2079206193082f29baf32d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page