Skip to main content

Fast and scalable array for machine learning and artificial intelligence

Project description

bigarray

Fast and scalable numpy array using Memory-mapped I/O

Stable build:

pip install bigarray

Nightly build from github:

pip install git+https://github.com/trungnt13/bigarray@master


The three principles

  • Transparency: everything is numpy.array, metadata and support for extra features (e.g. multiprocessing, indexing, etc) are subtly implemented in the background.
  • Pragmatism: fast but easy, simplified A.P.I for common use cases
  • Focus: "Do one thing and do it well"

The benchmarks

About 535 times faster than HDF5 data (using h5py) and 223 times faster than normal numpy.array The detail benchmark code

Array size: 1220.70 (MB)

Create HDF5   in: 0.0005580571014434099 s
Create Memmap in: 0.000615391880273819 s

Numpy save in: 0.5713834380730987 s
Writing data to HDF5  : 0.5530977640300989 s
Writing data to Memmap: 0.7038380969315767 s

Numpy saved size: 1220.70 (MB)
HDF5 saved size: 1220.71 (MB)
Mmap saved size: 1220.70 (MB)

Load Numpy array: 0.3723734531085938 s
Load HDF5 data  : 0.00041177100501954556 s
Load Memmap data: 0.00017150305211544037 s

Test correctness of stored data
Numpy : True
HDF5  : True
Memmap: True

Iterate Numpy data   : 0.00020254682749509811 s
Iterate HDF5 data    : 0.8945782391820103 s
Iterate Memmap data  : 0.0014937107916921377 s
Iterate Memmap (2nd) : 0.0011746759992092848 s

Numpy total time (open+iter): 0.3725759999360889 s
H5py  total time (open+iter): 0.8949900101870298 s
**Mmap  total time (open+iter): 0.001665213843807578 s**

Example

from multiprocessing import Pool

import numpy as np

from bigarray import PointerArray, PointerArrayWriter

n = 80 * 10  # total number of samples
jobs = [(i, i + 10) for i in range(0, n // 10, 10)]
path = '/tmp/array'

# ====== Multiprocessing writing ====== #
writer = PointerArrayWriter(path, shape=(n,), dtype='int32', remove_exist=True)


def fn_write(job):
  start, end = job
  # it is crucial to write at different position for different process
  writer.write(
      {"name%i" % i: np.arange(i * 10, i * 10 + 10) for i in range(start, end)},
      start_position=start * 10)


# using 2 processes to generate and write data
with Pool(2) as p:
  p.map(fn_write, jobs)
writer.flush()
writer.close()

# ====== Multiprocessing reading ====== #
x = PointerArray(path)
print(x['name0'])
print(x['name66'])
print(x['name78'])

# normal indexing
for name, (s, e) in x.indices.items():
  data = x[s:e]
# fast indexing
for name in x.indices:
  data = x[name]


# multiprocess indexing
def fn_read(job):
  start, end = job
  total = 0
  for i in range(start, end):
    total += np.sum(x['name%d' % i])
  return total

# use multiprocessing to calculate the sum of all arrays
with Pool(2) as p:
  total_sum = sum(p.map(fn_read, jobs))
print(np.sum(x), total_sum)

Output:

[0 1 2 3 4 5 6 7 8 9]
[660 661 662 663 664 665 666 667 668 669]
[780 781 782 783 784 785 786 787 788 789]
319600 319600

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigarray-0.2.2.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

bigarray-0.2.2-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file bigarray-0.2.2.tar.gz.

File metadata

  • Download URL: bigarray-0.2.2.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for bigarray-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0fd84a057549d54fb6bcc0662aef9981694178c757acf3f58f1558e8a2f7961f
MD5 82a9cbf196f95a30d7505e4775f5c8fe
BLAKE2b-256 c85b5238f40f8bf1068a94cc2a62be843bb887d778e5c5729065e0cc039dea4e

See more details on using hashes here.

File details

Details for the file bigarray-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: bigarray-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for bigarray-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4b864ba98dcbb5695ca791c14c025e55c3bae5423171c3a673d6dff9b5daba7c
MD5 ad3a054ada1dd6fa4988210d3a59b77c
BLAKE2b-256 8937b743fc2e7c18f81a8870b27d622b13b70a07d211b96a2bcd8de461aff59e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page