Skip to main content

No-copy parallelized bincount returning dict

Project description

No-copy parallelized bincount returning dict.


As of Nov 2018, np.bincount is unusable with large memmaps:

>>> import numpy as np
>>> np.bincount(np.memmap('some-5gb-file.txt', mode='r'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

The most effective pure-python solution for wc -l is a bit slow:

In [6]: %%time
   ...: sum(1 for i in open('some-5gb-file.txt', mode='rb'))
CPU times: user 3.5 s, sys: 878 ms, total: 4.38 s
Wall time: 4.38 s
Out[6]: 58941384

It is 3x times slower than wc -l:

In [1]: %%time
   ...: !wc -l some-5gb-file.txt
58941384 some-5gb-file.txt
CPU times: user 1.48 ms, sys: 3.48 ms, total: 4.96 ms
Wall time: 1.24 s

While it should be faster on modern multicore SMP systems:

In [1]: import numpy as np

In [2]: from bincount import bincount

In [3]: %%time
   ...: bincount(np.memmap('some-5gb-file.txt', mode='r'))[10]
CPU times: user 6.83 s, sys: 354 ms, total: 7.19 s
Wall time: 705 ms
Out[4]: 58941384


Prequirements: C-compiler with OpenMP support.

Install with pip:

pip install bincount


There is a bincount (a parallel version) and a bincount_single (which don’t parallelize the calculation) functions, both returning the dict containing the number of occurrences of each byte value in the passed bytes-like object:

>>> from bincount import bincount
>>> bincount(open('a-tiny-file.txt', 'rb').read())
{59: 2, 65: 5, 66: 1, 67: 3, 68: 2, 69: 3, 73: 4, 76: 7, 84: 3, 86: 1, 95: 4}

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for bincount, version 0.0.5
Filename, size File type Python version Upload date Hashes
Filename, size bincount-0.0.5.tar.gz (105.3 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page