Large data storage for pytorch
Project description
H5Record
Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)
Why?
-
Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:
- large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
- database approach : depend on what kind of database engine used, usually multi-process read is not supported
- the above method scale non linear in terms of data - storage size
-
TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )
-
However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )
H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).
Some advantage of using H5Record
-
Support multi-process read
-
Relatively simple to use and low technical debt
-
Support compression/de-compression on the fly
-
Quick load to memory if required
Simple usage
pip install h5record
- Sentence Similarity
from h5record import H5Dataset, Float, String
schema = (
String(name='sentence1'),
String(name='sentence2'),
Float(name='label')
)
data = [
['Sent 1.', 'Sent 2', 0.1],
['Sent 3', 'Sent 4', 0.2],
]
def pair_iter():
for row in data:
yield {
'sentence1': row[0],
'sentence2': row[1],
'label': row[2]
}
dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
print(dataset[idx])
Note
Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format
Comparison between different compression algorithm
No chunking is used
Compression Type | File size | Read speed row/second |
---|---|---|
no compression | 2.0G | 2084.55 it/s |
lzf | 1.7G | 1496.14 it/s |
gzip | 1.1G | 843.78 it/s |
benchmarked in i7-9700, 1TB NVMe SSD
If you are interested to learn more feel free to checkout the note as well!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file h5record-1.0.4.tar.gz
.
File metadata
- Download URL: h5record-1.0.4.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbebef9aa9e8ba413bf6c5defedc1333748b4dbb716432708ec35dafbd7ce6fe |
|
MD5 | 9b752eb1e8ebf20047473d49c67675e6 |
|
BLAKE2b-256 | a37f7c8728bb2c531d4859ffcc5c6a773832c9a3a9b78e5a009a6b6d4e7f35a0 |
File details
Details for the file h5record-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: h5record-1.0.4-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d11d67b1d411c80b9dc95616de677c9f20da2fe82256e3d199ff64bdb8cdfd44 |
|
MD5 | 80b0806190dee9751edca79e835df5f8 |
|
BLAKE2b-256 | 2ac5e61bd5e1a3d2edc5248674f3d6f6017155d51d72999276e65094cd0ad9d8 |