Skip to main content

Large data storage for pytorch

Project description

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Why?

  • Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:

    • large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
    • database approach : depend on what kind of database engine used, usually multi-process read is not supported
    • the above method scale non linear in terms of data - storage size
  • TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )

  • However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

  • Support multi-process read

  • Relatively simple to use and low technical debt

  • Support compression/de-compression on the fly

  • Quick load to memory if required

Simple usage

pip install h5record
  1. Sentence Similarity
from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type File size Read speed row/second
no compression 2.0G 2084.55 it/s
lzf 1.7G 1496.14 it/s
gzip 1.1G 843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5record-1.0.4.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

h5record-1.0.4-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file h5record-1.0.4.tar.gz.

File metadata

  • Download URL: h5record-1.0.4.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.5

File hashes

Hashes for h5record-1.0.4.tar.gz
Algorithm Hash digest
SHA256 dbebef9aa9e8ba413bf6c5defedc1333748b4dbb716432708ec35dafbd7ce6fe
MD5 9b752eb1e8ebf20047473d49c67675e6
BLAKE2b-256 a37f7c8728bb2c531d4859ffcc5c6a773832c9a3a9b78e5a009a6b6d4e7f35a0

See more details on using hashes here.

File details

Details for the file h5record-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: h5record-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.5

File hashes

Hashes for h5record-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d11d67b1d411c80b9dc95616de677c9f20da2fe82256e3d199ff64bdb8cdfd44
MD5 80b0806190dee9751edca79e835df5f8
BLAKE2b-256 2ac5e61bd5e1a3d2edc5248674f3d6f6017155d51d72999276e65094cd0ad9d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page