Fast and scalable array for machine learning and artificial intelligence
Project description
bigarray
Fast and scalable numpy array using Memory-mapped I/O
Stable build:
pip install bigarray
Nightly build from github:
pip install git+https://github.com/trungnt13/bigarray@master
The three principles
- Transparency: everything is
numpy.array
, metadata and support for extra features (e.g. multiprocessing, indexing, etc) are subtly implemented in the background. - Pragmatism: fast but easy, simplified A.P.I for common use cases
- Focus: "Do one thing and do it well"
The benchmarks
About 535 times faster than HDF5 data (using h5py
) and 223 times faster than normal numpy.array
The detail benchmark code
Array size: 1220.70 (MB)
Create HDF5 in: 0.0005580571014434099 s
Create Memmap in: 0.000615391880273819 s
Numpy save in: 0.5713834380730987 s
Writing data to HDF5 : 0.5530977640300989 s
Writing data to Memmap: 0.7038380969315767 s
Numpy saved size: 1220.70 (MB)
HDF5 saved size: 1220.71 (MB)
Mmap saved size: 1220.70 (MB)
Load Numpy array: 0.3723734531085938 s
Load HDF5 data : 0.00041177100501954556 s
Load Memmap data: 0.00017150305211544037 s
Test correctness of stored data
Numpy : True
HDF5 : True
Memmap: True
Iterate Numpy data : 0.00020254682749509811 s
Iterate HDF5 data : 0.8945782391820103 s
Iterate Memmap data : 0.0014937107916921377 s
Iterate Memmap (2nd) : 0.0011746759992092848 s
Numpy total time (open+iter): 0.3725759999360889 s
H5py total time (open+iter): 0.8949900101870298 s
**Mmap total time (open+iter): 0.001665213843807578 s**
Example
from multiprocessing import Pool
import numpy as np
from bigarray import PointerArray, PointerArrayWriter
n = 80 * 10 # total number of samples
jobs = [(i, i + 10) for i in range(0, n // 10, 10)]
path = '/tmp/array'
# ====== Multiprocessing writing ====== #
writer = PointerArrayWriter(path, shape=(n,), dtype='int32', remove_exist=True)
def fn_write(job):
start, end = job
# it is crucial to write at different position for different process
writer.write(
{"name%i" % i: np.arange(i * 10, i * 10 + 10) for i in range(start, end)},
start_position=start * 10)
# using 2 processes to generate and write data
with Pool(2) as p:
p.map(fn_write, jobs)
writer.flush()
writer.close()
# ====== Multiprocessing reading ====== #
x = PointerArray(path)
print(x['name0'])
print(x['name66'])
print(x['name78'])
# normal indexing
for name, (s, e) in x.indices.items():
data = x[s:e]
# fast indexing
for name in x.indices:
data = x[name]
# multiprocess indexing
def fn_read(job):
start, end = job
total = 0
for i in range(start, end):
total += np.sum(x['name%d' % i])
return total
# use multiprocessing to calculate the sum of all arrays
with Pool(2) as p:
total_sum = sum(p.map(fn_read, jobs))
print(np.sum(x), total_sum)
Output:
[0 1 2 3 4 5 6 7 8 9]
[660 661 662 663 664 665 666 667 668 669]
[780 781 782 783 784 785 786 787 788 789]
319600 319600
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bigarray-0.2.2.tar.gz
(11.6 kB
view details)
Built Distribution
bigarray-0.2.2-py3-none-any.whl
(17.3 kB
view details)
File details
Details for the file bigarray-0.2.2.tar.gz
.
File metadata
- Download URL: bigarray-0.2.2.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fd84a057549d54fb6bcc0662aef9981694178c757acf3f58f1558e8a2f7961f |
|
MD5 | 82a9cbf196f95a30d7505e4775f5c8fe |
|
BLAKE2b-256 | c85b5238f40f8bf1068a94cc2a62be843bb887d778e5c5729065e0cc039dea4e |
File details
Details for the file bigarray-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: bigarray-0.2.2-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b864ba98dcbb5695ca791c14c025e55c3bae5423171c3a673d6dff9b5daba7c |
|
MD5 | ad3a054ada1dd6fa4988210d3a59b77c |
|
BLAKE2b-256 | 8937b743fc2e7c18f81a8870b27d622b13b70a07d211b96a2bcd8de461aff59e |