Summation algorithms

In bug 61143, which originally was about overflow in mean() when averaging a large number of singles, a discussion about which algorithm to use for summation (by default, for single and/or double inputs) evolved. Perhaps here is a more appropriate place to continue this discussion and hopefully arrive at a solution.

Octave is written for the users, who show a broad range of degrees of experience with numerical software, and the question is how we can be most helpful to this community. The three criteria are performance, accuracy, and the avoidance of surprises (the last with respect to which algorithm to use by default) – memory consumption is always insignificant.

For simplicity, in the following discussion I will always only talk about summing a single vector – summing over dimensions of multi-dimensional arrays is done as if they were sliced into such vectors beforehand. As of now, per default summation is performed by the naive algorithm, where starting from the beginning of the vector the entries are sequentially added to an accumulant, and at the end the value of this accumulant is returned. For floating-point inputs per default the type of the accumulant is the same as of the input, while specifying the option “double” allows single-precision inputs to be summed with a double-precision accumulant, and the returned value also has type double (the other way round, summing doubles with single precision, is not supported). Further, specifying “extra” for double inputs uses a compensated algorithm, which for the present purposes can be seen as an improved variant of Kahan summation.

So to discuss the three criteria: quick testing (on 32-bit and 64-bit octave, both running on amd64 Intel hardware) shows that performance-wise all three naive variants now available are quite equal (summing singles with native precision, summing singles with double precision, and summing doubles with native precision), only summing doubles with “extra” accuracy takes a bit more than two times as long.

With respect to accuracy, it is more complicated: summing with single precision leads quite quickly to obvious accuracy problems such as reported in the bug report. For double precision, my tests showed that my 32-bit octave seems to actually use an accumulant of 80-bit extended precision, which in most cases is good enough, but my 64-bit system quite certainly uses only 64-bit doubles, so accuracy problems also when summing doubles with the default algorithm are not hard to find (see the end of this post). Summing with “extra” will be good enough apart from cases constructed to make it fail.

And with respect to the avoidance of surprises: here the opinions seem to be quite diverging – in the discussion of bug 61143, there are those that argue that the limitations of floating-point arithmetics should be known, implying that the users should be aware of the consequences when summming a large vector of inputs, and that using a more accurate algorithm would be counterproductive by confusing those who do expect discrepancies due to finite accuracy. On the other hand, the very existence of the bug report shows that there are also users that did not expect such accuracy issues.

In the bug discussion, I have attached a sample implementation of pair-wise summation. As I argue there, performance-wise it is as good (even better in my implementation) than all the naive summations, and accuracy-wise it is as good as “extra” summation. Thus, I propose to use it by default in all cases apart from explicit “extra”, specifically also in the case of single precision inputs, converting the returned value to single precision at the end in this case. I think that, yes, the users should be aware of the limitations of floating-point representations (double-precision numbers are only double precision, not more), but with respect to the used algorithms, they can rightly expect better algorithms to be used by default if those are easily to be implemented (where better means more accurate at equal performance).

For reference, also Matlab recently have recognized that naive summation is not so good in terms of accuracy, and have switched to an algorithm that is a bit better (without specifying it in detail) https://blogs.mathworks.com/loren/2021/09/07/a-faster-and-more-accurate-sum/. Thus, there is no reason to feel bound to the current behaviour, but for the future we should of course specify the used algorithm in the documentation.

Finally, to show that the accuracy issues when summing doubles with default options (in the case of 64-bit accumulant precision, as I see here on 64-bit octave) can be significant also for typical problems, I attachtest_sum_integral.m (581 Bytes)
a script that integrates exp(-x) between 0 and 1 by the midpoint rule (to be used with the oct-file sum_pair3.oct as attached to the bug discussion): there are rigorous error bounds, but starting at around N=2000000 function evaluations, the error of the numerical integral (with default summation) is larger than the analytical error bound, decreases only slowly at this range of N, and even increases at very large N. On the other hand, the error of pair-wise summation is always smaller than the analytical bound, and in particular even reaches zero when this analytical bound is smaller than eps.

I haven’t done any testing yet or even looked at the proposed changes. Just one question for clarification:
Are you proposing to always convert single array input to double for sum and convert back to single at the end?

In effect, yes. But to be specific, the only thing to change would be the type of the accumulant (in the compiled code): you take the single-precision values as they come, but add them to a double-precision value, which automatically promotes the precision of the operation to double (converting the input array would need additional memory), and at the end convert the result back to single. This means additional effort for the floating-point unit due to having to expand the singles to doubles, and the elementary sum operations have to be performed for wider types, but interestingly, there is nearly no difference in performance. You can try that yourself by timing the operations
sum(d)
sum(d,“double”)
sum(double(d))
where d is a single-precision vector – the first two take equally long (the third takes longer), and the second two give the same result.

Thanks for the clarification. I was worried about the memory footprint. But if it is only the accumulant that has the wider type, that is probably no big deal.
I haven’t looked at the implementation. Does the size of the accumulant depend on the size of the result? E.g. would something like sum(rand(1e3,1e3,2, 'single'), 3) create a large intermediate accumulant?

It’s interesting that sum(d) and sum(d, "double") takes roughly the same time for single precision input d. I can confirm that observation with Octave 6.3.0 (32-bit) on Windows 10 (64-bit).

Up to now, the implementation is only in terms of an .oct-file that gives an additional function sum_pair3() (never mind the name). It does not do any input validation or has any notion of multi-dimensional arrays, it would just sum all the values in the array – it shall serve as basis for discussing achievable performance and accuracy. In the end, it should replace the present internal sum(), specifically the part where the summation is actually performed – it would take over all the input validation etc., and in particular it would only see the vectors handed to it.

So to aswer your question: no, there is no large intermediate accumulant, because also now, in your command the sum is done sequentially only over two-element vectors, and once one sum is computed, its result is put into the 1000x1000 result array, and the next two-element vector is extracted. By the way, it wouldn’t be extracted in the sense of allocating a copy of it, but pointers to the first element to be summed and a stride length would be passed to the fundamental sum routine.