Skip to main content

dareblopy

Project description



Framework agnostic, faster data reading for DeepLearning.

A native extension for Python built with C++ and pybind11.

InstallationWhy?What is the performance gain?TutorialLicense

PyPI version

DataReadingBlocks for Python is a python module that provides collection of C++ backed data reading primitives. It targets deep-learning needs, but it is framework agnostic.

Installation

Available as pypi package:

$ pip install dareblopy

To build from sources refer to wiki page.

Why?

Development initially started to speedup reading from ZIP archives, reduce copying data, increase time of GIL being released to improve concurrency.

But why reading from ZIP archive? Reading a ton of small files (which is often the case) can be slow, specially if the drive is network attached, e.g. with NFS. However, the bottle neck here is hardly the disk speed, but the overhead of filesystem, name-lookup, creating file descriptors, and additional network usage if NFS is used.

If, all the small files are agglomerated into larger file (or several large files), that improves performance substantially. This is exactly the reason behind TFRecords in TensorFlow:

To read data efficiently it can be helpful to serialize your data and store it in a set of files (100-200MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.

The downside of TFRecords is that it's TensorFlow only.

A much simpler, yet still effective solution is to store data in ZIP archive with zero compression. However, using zipfile package from standard library can be slow, since it is implemented purely in Python and in certain cases can cause unnecessary data copying.

That's precisely the reason behind development of DareBlopy. In addition to that it also has such features as:

  • Readying JPEG images directly to numpy arrays (from ZIP and from filesystem), to reduce memory usage and unnecessary data copies.
  • Two JPEG backends selectable at run-time: libjpeg and libjpeg-turbo. Both backends are embedded into DareBlopy and do not depend on any system package.
  • Reading of TFRecords (not all features are support though) without dependency on TensorFlow that enables usage of datasets stored as TFRecords with ML frameworks other than TensorFlow, e.g. Pytorch.
  • Random yielders, iterators and, dataloaders to simplify doing DataLearning with TFRecords with other ML frameworks.
  • No dependency on system packages. You install it from pip - it works.
  • Support for compressed ZIP archives, including LZ4 compression.
  • Virtual filesystem. Allows mounting of zip archives.

What is the performance gain?

Well, it depends a lot on a particular use-case. Let's consider several. All details of the benchmarks you can find in run_benchmark.py. You can also run it on your machine and compare results to the ones reported here.

Reading files to bytes

Python's bytes object can be a bit nasty. Generally speaking, you can not return from C/C++ land data as a bytes object without making a data copy. That's because memory for bytes object must be allocated as one chunk for both, the header and data itself. In DareBlopy this extra copy is eliminated, you can find details here.

In this test scenario, we read 200 files, each of which ~30kb. Reading is done from local filesystem and from a ZIP archive.

Reading files using DareBlopy is faster even when read from filesystem, but when read from ZIP it provides substantial improvement.

Reading JPEGs to numpy's ndarray

This is where DareBlopy's feature of direct readying to numpy array is demonstrated. When the file is read, it is decompressed directly to a preallocated numpy array, and all of that happens on C++ land while GIL is released.

Note: here PIL v.7.0.0 is used, on Ubuntu 18. In my installation, it does not use libjpeg-turbo.

It this case, difference between ZIP/filesystem is quite insignificant, but things change dramatically if filesystem is streamed over a network:

Reading TFRecords

DareBlopy can read TensorFlow records. This functionality was developed in the first place for reading FFHQ dataset from TFRecords.

It introduces alias to string type: uint8, which allows to return directly numpy array if the shape is known beforehand.

For example, code like:

        features = {
            'data': db.FixedLenFeature([], db.string)
        }

Can be replaced with:

        features = {
            'data': db.FixedLenFeature([3, 32, 32], db.uint8)
        }

This decoding to numpy array comes at zero cost, which is demonstrated below:

Tutorial

Import DareBlopy

import dareblopy as db
from IPython.display import Image, display
import PIL.Image

Open zip archive:

archive = db.open_zip_archive("test_utils/test_image_archive.zip")

Read image to bytes and display:

b = archive.open_as_bytes('0.jpg')
Image(b)

jpeg

Alternatively, read image to numpy:

img = archive.read_jpg_as_numpy('0.jpg')
img.shape
(256, 256, 3)
display(PIL.Image.fromarray(img))

png

For more advanced usage please refer to:

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

dareblopy-0.0.5-cp38-cp38-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8 Windows x86-64

dareblopy-0.0.5-cp38-cp38-win32.whl (1.2 MB view details)

Uploaded CPython 3.8 Windows x86

dareblopy-0.0.5-cp38-cp38-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

dareblopy-0.0.5-cp37-cp37m-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.7m Windows x86-64

dareblopy-0.0.5-cp37-cp37m-win32.whl (1.2 MB view details)

Uploaded CPython 3.7m Windows x86

dareblopy-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

dareblopy-0.0.5-cp36-cp36m-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.6m Windows x86-64

dareblopy-0.0.5-cp36-cp36m-win32.whl (1.2 MB view details)

Uploaded CPython 3.6m Windows x86

dareblopy-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

dareblopy-0.0.5-cp35-cp35m-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

dareblopy-0.0.5-cp34-cp34m-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.4m manylinux: glibc 2.12+ x86-64

dareblopy-0.0.5-cp27-cp27mu-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

dareblopy-0.0.5-cp27-cp27m-win_amd64.whl (1.3 MB view details)

Uploaded CPython 2.7m Windows x86-64

dareblopy-0.0.5-cp27-cp27m-win32.whl (1.1 MB view details)

Uploaded CPython 2.7m Windows x86

dareblopy-0.0.5-cp27-cp27m-manylinux2010_x86_64.whl (12.3 MB view details)

Uploaded CPython 2.7m manylinux: glibc 2.12+ x86-64

File details

Details for the file dareblopy-0.0.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 1b3c724f0f7bc83e8f58c3ff8d19e7b5093bab3da9c74289a3b0066b34602495
MD5 16b315f6384b4e108b55cbeb06225f6b
BLAKE2b-256 47bdd5aa6ff64be048b8519d8d42461b2f003dcaf8bfdeeecd7a8177708a603a

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp38-cp38-win32.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 4738c620ec4829396d26e82749607f962e3f9ad42cf1ae4d51503909d086e900
MD5 3e62f1b7dda38a14c250fab75e8bf667
BLAKE2b-256 0733d449f6eb56f913407f2e4c29af10e2153354d8c6d21dcc19088c0e7dfe18

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9fbfe0f06d8fb68887225ab4b2f7aa831f6b26a9405f266ec50bcae33794c329
MD5 990885dbbf8c6d7898f575c2f5f4216a
BLAKE2b-256 85e6105445a3b2e303a78d166a8678ae8dbeacde5c3b53b4542afa50bc766573

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 199d11e9ef62df5374c91d549ac8116145eb86d22605f16447e04284f2812d87
MD5 da20415852d31daafe30040c12b4a69a
BLAKE2b-256 6491fbbd7c47936e13d97609c423c7cb17cd5e2e77ea9a47684c1f7474eb034a

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp37-cp37m-win32.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 1e5ab6a2deb41e10ff65509e82692d74ceaf9b7d837f1bfdb6e2827dfc8b9a2c
MD5 40e1b98f1a56bcec893d220ab854d923
BLAKE2b-256 445397bcda59ad0a7bc089c793fad90231589d2dfe008fce61d6d37dcfb52441

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0302b0d4db187872f36401dc9c8342ceaf03dfa4654bbf7041f46300af8e6e51
MD5 b383b26619ea1180bb6a8ecc6f3ca05e
BLAKE2b-256 1497f2cec7c0f2d8c0be734d7b0b7203c54c602bcc2a29f1763fe350059ba69c

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 4d4cb248ab209e5a73ba291208ed5e37e299f3fd991b84e99708272b9f1ce805
MD5 7a23a203c46ad46dad825a5b260e6dbe
BLAKE2b-256 34c1bd34f1baf87a65674050fd00bdb07513545d77b9f296ba9c1dc4e817f7d9

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp36-cp36m-win32.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 fca4c28aafd53fcd2fbe2f99cec60b2866fcc4429e2a2be85b8ecd227de9cd23
MD5 df0c6a2aa7fcc8a38ac3e18e10cfb9d5
BLAKE2b-256 963467f316e43f1d7c0fd83f23cef0cd4678022c74d9679fe55cc7f76379d814

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c361291c55f28569c3c0d36e47a50d0d35c30e1d4eb9aafdfe79faa79e3c6970
MD5 3439fd65635d1d2fd76297e2c8424ef5
BLAKE2b-256 ddaf5248bb058c558886bd369a5d53d8dd87c77bbddf44c088b2b677e5de9ee2

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0aaada1378ac6decbbc86359c0c67ca3e7fc512a49a82e3c61b411ad2cfb42d7
MD5 ddbdac6b64b76acab0c7023ff401107a
BLAKE2b-256 346433c88923e06d186330bc2867bb6d87ce3ec9b2a0c46ed25fd7ca287d001b

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp34-cp34m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp34-cp34m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.4m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp34-cp34m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 58ca250a12dfcb687845842f67d86f6f9973806b7d6400215c1125ce90034e13
MD5 3a6963e945cd8d432e06bee730167a44
BLAKE2b-256 6b6cc7d9d1a1b29de4ce64873f8cc812fde6c782a7fc0b3e89f2f413d3bd37e8

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp27-cp27mu-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 2.7mu, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f3ac40253b9fdafa6e55002be6701426bc3c15c9abe331db2c389413f0fadc31
MD5 a141d8e835bfa60b60c8f1e9bf5b6cf0
BLAKE2b-256 6284121577c1234f9cd6dbdbe8a2cec3fc09532dd1261954187df8ecf78fe923

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp27-cp27m-win_amd64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp27-cp27m-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 2.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 336c22cd738c9c9baf9358c07098976affeaeb1719e56c15bf8c9b0ddd2b8b41
MD5 dae7525e39c1dd8c224a7875060be0b1
BLAKE2b-256 01b32457d79f767dff36cbf26b4e313ef0f0fd8d9ac0027ff9d70c94e577f771

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp27-cp27m-win32.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp27-cp27m-win32.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 2.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.6.6

File hashes

Hashes for dareblopy-0.0.5-cp27-cp27m-win32.whl
Algorithm Hash digest
SHA256 cc1c0dfaf49ab98d0f90c5deae7b549709f7975a6c75f063bbf02d01c3b9848a
MD5 2d0a7d84f80a47faf92362d92644bbe9
BLAKE2b-256 995eb49e24dc0974db2da62ce82c62794a71bf8e6338ed7283662c0fc863e57f

See more details on using hashes here.

File details

Details for the file dareblopy-0.0.5-cp27-cp27m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: dareblopy-0.0.5-cp27-cp27m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 2.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/49.3.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for dareblopy-0.0.5-cp27-cp27m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 799d58823e00e7930af7da43fe2c56bd32d2a81cec67121d6b95dd3e3704919b
MD5 a3440f719920f7b0661b81523bbd6330
BLAKE2b-256 15267d73fb5b225e297cf4091dd30838b818ccf953872ee8ec9002ea2b8f0115

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page