femto·PyPI

What's the fastest way to sum a NumPy array?

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- C
- Python
- Python :: 3
Topic
- Scientific/Engineering

Project description

What’s the fastest way to sum a NumPy array? I don’t know—that’s why I created femto.

femto, written in C, contains several implementations of a sum function. To keep things simple the input array must be at least 2d and the axis of summation cannot be None. Limiting ourselves to the 1d case would have be even simpler. But I am interested in both summing along an axis where the array elements are closely packed in memory (e.g. axis=-1 of a C contiguous array) and where they are widely spaced (axis=0). Both cases require different optimizations.

My goal is to find fast ways to implement reduction functions (sum, mean, std, max, nansum, etc.) that are bound by memory I/O. I chose summation as a test case because very little time is spent with arithmetic which makes it easier to measure improvements from things like manual loop unrolling, SIMD, and OpenMP.

femto, based on code from bottleneck, comes with several benchmark suites:

>>> ss.bench_axis0()
femto performance benchmark
    femto 0.0.1.dev0; Numpy 1.11.2
    Speed is NumPy time divided by femto time
    Score is harmonic mean of speeds

      (1000,1000)(1000,1000)(1000,1000)(1000,1000)
        float64    float32     int64      int32
         axis=0     axis=0     axis=0     axis=0     score
sum00     0.34       0.22       0.63       1.19       0.40
sum01     0.35       0.22       0.65       1.20       0.41
sum02     0.47       0.27       0.69       1.19       0.49
sum03     0.56       0.45       0.87       1.54       0.69
sum04     0.81       0.45       0.56       1.55       0.68
sum10     0.80       0.45       1.20       1.90       0.83
sum11     1.19       1.34       1.20       1.88       1.35
sum12     1.19       0.45       1.21       1.89       0.91
p_sum01   1.30       0.22       2.02       1.36       0.62
p_sum02   1.38       0.23       2.41       1.36       0.63
p_sum03   1.90       1.13       2.73       4.28       1.99
p_sum04   3.22       1.07       2.71       4.33       2.16

>>> ss.bench_axis1()
femto performance benchmark
    femto 0.0.1.dev0; Numpy 1.11.2
    Speed is NumPy time divided by femto time
    Score is harmonic mean of speeds

      (1000,1000)(1000,1000)(1000,1000)(1000,1000)
        float64    float32     int64      int32
         axis=1     axis=1     axis=1     axis=1     score
sum00     0.48       0.44       0.84       1.32       0.63
sum01     0.47       0.45       1.05       1.85       0.68
sum02     1.13       1.29       1.38       3.07       1.47
sum03     1.13       1.27       1.37       3.07       1.47
sum04     1.12       1.29       1.36       3.11       1.47
sum10     1.12       1.28       1.37       3.06       1.47
sum11     1.42       3.04       1.38       3.07       1.92
sum12     1.45       1.29       1.39       3.03       1.59
p_sum01   3.11       3.04       4.16       6.80       3.85
p_sum02   4.37       4.37       5.48      10.65       5.45
p_sum03   4.20       4.35       5.52      10.48       5.37
p_sum04   4.43       4.30       5.49      10.36       5.43

I chose numpy.sum as a benchmark because it is fast and convenient. It should be possible to beat NumPy’s performance. That’s because femto has an unfair advantage. We will not duplicate the pairwise summation NumPy uses to deal with the accumulated round-off error in floating point arrays.

The overall fastest function is the one with the highest benchmark score. Let’s consider the case where we benchmark each function with two arrays (five are used by default in the benchmark) and the speeds are 0.5 (half as fast as NumPy) and 2.0 (twice as fast). What should the overall score be? Some possibilities are the mean (1.25, which is faster than NumPy), the geometric mean (1.0, same as NumPy), or the harmonic mean (0.8, slower). I chose the harmonic mean. If a NumPy program spends equal time summing the two benchmark arrays, each 1 unit of time, then it will take 1/2 + 2 units of time with femto, which is a speed of 2/2.5 = 0.8.

Let’s look at function call overhead by benchmarking with small input arrays:

>>> ss.bench_overhead()
femto performance benchmark
    femto 0.0.1.dev0; Numpy 1.11.2
    Speed is NumPy time divided by femto time
    Score is harmonic mean of speeds

        (10,10)    (10,10)   (100,100)  (100,100)
        float64    float64    float64    float64
         axis=0     axis=1     axis=0     axis=1     score
sum00    16.11      16.11       1.12       0.98       1.96
sum01    13.82      13.22       1.17       1.03       2.03
sum02    15.83      15.91       2.76       2.51       4.51
sum03    16.94      13.38       2.81       2.40       4.42
sum04    17.85      13.45       4.53       2.38       5.19
sum10    16.67      18.16       2.01       2.52       3.96
sum11    18.33      18.62       3.33       3.52       5.78
sum12    18.29      16.14       3.27       3.01       5.30
p_sum01   1.75       1.73       2.59       2.20       2.01
p_sum02   1.80       1.77       2.83       2.58       2.15
p_sum03   1.83       1.87       2.94       2.55       2.21
p_sum04   1.90       1.85       3.30       2.46       2.25

Please help me avoid over optimizing for my particular operating system, CPU, and compiler. Let me know the benchmark results on your system. If you have ideas on how to speed up the code then share them.

License

femto is distributed under the GPL v3+. See the LICENSE file for details.

Requirements

Currently femto only compiles on GNU/Linux. Please help us with getting it to compile on OSX and Windows.

SSE3, AVX, x86intrin.h, OpenMP
Python 2.7, 3.4, 3.5
NumPy 1.11
gcc
nose

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- C
- Python
- Python :: 3
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.0.1.dev0 pre-release

Jan 30, 2017

femto 0.0.1.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

License

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed