No-copy parallelized bincount returning dict
Project description
No-copy parallelized bincount returning dict.
Motivation
As of Nov 2018, np.bincount is unusable with large memmaps:
>>> import numpy as np
>>> np.bincount(np.memmap('some-5gb-file.txt', mode='r'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
The most effective pure-python solution for wc -l is a bit slow:
In [6]: %%time
...: sum(1 for i in open('some-5gb-file.txt', mode='rb'))
...:
CPU times: user 3.5 s, sys: 878 ms, total: 4.38 s
Wall time: 4.38 s
Out[6]: 58941384
It is 3x times slower than wc -l:
In [1]: %%time
...: !wc -l some-5gb-file.txt
...:
58941384 some-5gb-file.txt
CPU times: user 1.48 ms, sys: 3.48 ms, total: 4.96 ms
Wall time: 1.24 s
While it should be faster on modern multicore SMP systems:
In [1]: import numpy as np
In [2]: from bincount import bincount
In [3]: %%time
...: bincount(np.memmap('some-5gb-file.txt', mode='r'))[10]
...:
CPU times: user 6.83 s, sys: 354 ms, total: 7.19 s
Wall time: 705 ms
Out[4]: 58941384
Install
Prequirements: C-compiler with OpenMP support.
Install with pip:
pip install bincount
Usage
There is a bincount (a parallel version) and a bincount_single (which don’t parallelize the calculation) functions, both returning the dict containing the number of occurrences of each byte value in the passed bytes-like object:
>>> from bincount import bincount
>>> bincount(open('a-tiny-file.txt', 'rb').read())
{59: 2, 65: 5, 66: 1, 67: 3, 68: 2, 69: 3, 73: 4, 76: 7, 84: 3, 86: 1, 95: 4}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file bincount-0.0.5.tar.gz
.
File metadata
- Download URL: bincount-0.0.5.tar.gz
- Upload date:
- Size: 105.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a898625ce9d90430d54283572ca179319720a1d0364a0998b9924ac9c65be350 |
|
MD5 | b8bd6bebbaed35b2ce94171873eac35a |
|
BLAKE2b-256 | 10c461c04d56271fce78d6c6744f04b23bb124fafac0954ab3fd475732983834 |