What's the fastest way to sum a NumPy array?

## Project description

What’s the fastest way to sum a NumPy array? I don’t know—that’s why I created femto.

femto, written in C, contains several implementations of a sum function. To keep things simple the input array must be at least 2d and the axis of summation cannot be None. Limiting ourselves to the 1d case would have be even simpler. But I am interested in both summing along an axis where the array elements are closely packed in memory (e.g. axis=-1 of a C contiguous array) and where they are widely spaced (axis=0). Both cases require different optimizations.

My goal is to find fast ways to implement reduction functions (sum, mean, std, max, nansum, etc.) that are bound by memory I/O. I chose summation as a test case because very little time is spent with arithmetic which makes it easier to measure improvements from things like manual loop unrolling, SIMD, and OpenMP.

femto, based on code from bottleneck, comes with several benchmark suites:

>>> ss.bench_axis0()
femto performance benchmark
femto 0.0.1.dev0; Numpy 1.11.2
Speed is NumPy time divided by femto time
Score is harmonic mean of speeds

(1000,1000)(1000,1000)(1000,1000)(1000,1000)
float64    float32     int64      int32
axis=0     axis=0     axis=0     axis=0     score
sum00     0.34       0.22       0.63       1.19       0.40
sum01     0.35       0.22       0.65       1.20       0.41
sum02     0.47       0.27       0.69       1.19       0.49
sum03     0.56       0.45       0.87       1.54       0.69
sum04     0.81       0.45       0.56       1.55       0.68
sum10     0.80       0.45       1.20       1.90       0.83
sum11     1.19       1.34       1.20       1.88       1.35
sum12     1.19       0.45       1.21       1.89       0.91
p_sum01   1.30       0.22       2.02       1.36       0.62
p_sum02   1.38       0.23       2.41       1.36       0.63
p_sum03   1.90       1.13       2.73       4.28       1.99
p_sum04   3.22       1.07       2.71       4.33       2.16

>>> ss.bench_axis1()
femto performance benchmark
femto 0.0.1.dev0; Numpy 1.11.2
Speed is NumPy time divided by femto time
Score is harmonic mean of speeds

(1000,1000)(1000,1000)(1000,1000)(1000,1000)
float64    float32     int64      int32
axis=1     axis=1     axis=1     axis=1     score
sum00     0.48       0.44       0.84       1.32       0.63
sum01     0.47       0.45       1.05       1.85       0.68
sum02     1.13       1.29       1.38       3.07       1.47
sum03     1.13       1.27       1.37       3.07       1.47
sum04     1.12       1.29       1.36       3.11       1.47
sum10     1.12       1.28       1.37       3.06       1.47
sum11     1.42       3.04       1.38       3.07       1.92
sum12     1.45       1.29       1.39       3.03       1.59
p_sum01   3.11       3.04       4.16       6.80       3.85
p_sum02   4.37       4.37       5.48      10.65       5.45
p_sum03   4.20       4.35       5.52      10.48       5.37
p_sum04   4.43       4.30       5.49      10.36       5.43

I chose numpy.sum as a benchmark because it is fast and convenient. It should be possible to beat NumPy’s performance. That’s because femto has an unfair advantage. We will not duplicate the pairwise summation NumPy uses to deal with the accumulated round-off error in floating point arrays.

The overall fastest function is the one with the highest benchmark score. Let’s consider the case where we benchmark each function with two arrays (five are used by default in the benchmark) and the speeds are 0.5 (half as fast as NumPy) and 2.0 (twice as fast). What should the overall score be? Some possibilities are the mean (1.25, which is faster than NumPy), the geometric mean (1.0, same as NumPy), or the harmonic mean (0.8, slower). I chose the harmonic mean. If a NumPy program spends equal time summing the two benchmark arrays, each 1 unit of time, then it will take 1/2 + 2 units of time with femto, which is a speed of 2/2.5 = 0.8.

Let’s look at function call overhead by benchmarking with small input arrays:

>>> ss.bench_overhead()
femto performance benchmark
femto 0.0.1.dev0; Numpy 1.11.2
Speed is NumPy time divided by femto time
Score is harmonic mean of speeds

(10,10)    (10,10)   (100,100)  (100,100)
float64    float64    float64    float64
axis=0     axis=1     axis=0     axis=1     score
sum00    16.11      16.11       1.12       0.98       1.96
sum01    13.82      13.22       1.17       1.03       2.03
sum02    15.83      15.91       2.76       2.51       4.51
sum03    16.94      13.38       2.81       2.40       4.42
sum04    17.85      13.45       4.53       2.38       5.19
sum10    16.67      18.16       2.01       2.52       3.96
sum11    18.33      18.62       3.33       3.52       5.78
sum12    18.29      16.14       3.27       3.01       5.30
p_sum01   1.75       1.73       2.59       2.20       2.01
p_sum02   1.80       1.77       2.83       2.58       2.15
p_sum03   1.83       1.87       2.94       2.55       2.21
p_sum04   1.90       1.85       3.30       2.46       2.25

Please help me avoid over optimizing for my particular operating system, CPU, and compiler. Let me know the benchmark results on your system. If you have ideas on how to speed up the code then share them.

femto is distributed under the GPL v3+. See the LICENSE file for details.

## Requirements

Currently femto only compiles on GNU/Linux. Please help us with getting it to compile on OSX and Windows.

• SSE3, AVX, x86intrin.h, OpenMP

• Python 2.7, 3.4, 3.5

• NumPy 1.11

• gcc

• nose