Skip to main content

No-copy parallelized bincount returning dict

Project description

No-copy parallelized bincount returning dict.

Motivation

As of Nov 2018, np.bincount is unusable with large memmaps:

>>> import numpy as np
>>> np.bincount(np.memmap('some-5gb-file.txt', mode='r'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

The most effective pure-python solution for wc -l is a bit slow:

In [6]: %%time
   ...: sum(1 for i in open('some-5gb-file.txt', mode='rb'))
   ...:
CPU times: user 3.5 s, sys: 878 ms, total: 4.38 s
Wall time: 4.38 s
Out[6]: 58941384

It is 3x times slower than wc -l:

In [1]: %%time
   ...: !wc -l some-5gb-file.txt
   ...:
58941384 some-5gb-file.txt
CPU times: user 1.48 ms, sys: 3.48 ms, total: 4.96 ms
Wall time: 1.24 s

While it should be faster on modern multicore SMP systems:

In [1]: import numpy as np

In [2]: from bincount import bincount

In [3]: %%time
   ...: bincount(np.memmap('some-5gb-file.txt', mode='r'))[10]
   ...:
CPU times: user 6.83 s, sys: 354 ms, total: 7.19 s
Wall time: 705 ms
Out[4]: 58941384

Install

Prequirements: C-compiler with OpenMP support.

Install with pip:

pip install bincount

Usage

There is a bincount (a parallel version) and a bincount_single (which don’t parallelize the calculation) functions, both returning the dict containing the number of occurrences of each byte value in the passed bytes-like object:

>>> from bincount import bincount
>>> bincount(open('a-tiny-file.txt', 'rb').read())
{59: 2, 65: 5, 66: 1, 67: 3, 68: 2, 69: 3, 73: 4, 76: 7, 84: 3, 86: 1, 95: 4}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bincount-0.0.5.tar.gz (105.3 kB view details)

Uploaded Source

File details

Details for the file bincount-0.0.5.tar.gz.

File metadata

  • Download URL: bincount-0.0.5.tar.gz
  • Upload date:
  • Size: 105.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for bincount-0.0.5.tar.gz
Algorithm Hash digest
SHA256 a898625ce9d90430d54283572ca179319720a1d0364a0998b9924ac9c65be350
MD5 b8bd6bebbaed35b2ce94171873eac35a
BLAKE2b-256 10c461c04d56271fce78d6c6744f04b23bb124fafac0954ab3fd475732983834

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page