In bug 61143, which originally was about overflow in mean() when averaging a large number of singles, a discussion about which algorithm to use for summation (by default, for single and/or double inputs) evolved. Perhaps here is a more appropriate place to continue this discussion and hopefully arrive at a solution.
Octave is written for the users, who show a broad range of degrees of experience with numerical software, and the question is how we can be most helpful to this community. The three criteria are performance, accuracy, and the avoidance of surprises (the last with respect to which algorithm to use by default) – memory consumption is always insignificant.
For simplicity, in the following discussion I will always only talk about summing a single vector – summing over dimensions of multi-dimensional arrays is done as if they were sliced into such vectors beforehand. As of now, per default summation is performed by the naive algorithm, where starting from the beginning of the vector the entries are sequentially added to an accumulant, and at the end the value of this accumulant is returned. For floating-point inputs per default the type of the accumulant is the same as of the input, while specifying the option “double” allows single-precision inputs to be summed with a double-precision accumulant, and the returned value also has type double (the other way round, summing doubles with single precision, is not supported). Further, specifying “extra” for double inputs uses a compensated algorithm, which for the present purposes can be seen as an improved variant of Kahan summation.
So to discuss the three criteria: quick testing (on 32-bit and 64-bit octave, both running on amd64 Intel hardware) shows that performance-wise all three naive variants now available are quite equal (summing singles with native precision, summing singles with double precision, and summing doubles with native precision), only summing doubles with “extra” accuracy takes a bit more than two times as long.
With respect to accuracy, it is more complicated: summing with single precision leads quite quickly to obvious accuracy problems such as reported in the bug report. For double precision, my tests showed that my 32-bit octave seems to actually use an accumulant of 80-bit extended precision, which in most cases is good enough, but my 64-bit system quite certainly uses only 64-bit doubles, so accuracy problems also when summing doubles with the default algorithm are not hard to find (see the end of this post). Summing with “extra” will be good enough apart from cases constructed to make it fail.
And with respect to the avoidance of surprises: here the opinions seem to be quite diverging – in the discussion of bug 61143, there are those that argue that the limitations of floating-point arithmetics should be known, implying that the users should be aware of the consequences when summming a large vector of inputs, and that using a more accurate algorithm would be counterproductive by confusing those who do expect discrepancies due to finite accuracy. On the other hand, the very existence of the bug report shows that there are also users that did not expect such accuracy issues.
In the bug discussion, I have attached a sample implementation of pair-wise summation. As I argue there, performance-wise it is as good (even better in my implementation) than all the naive summations, and accuracy-wise it is as good as “extra” summation. Thus, I propose to use it by default in all cases apart from explicit “extra”, specifically also in the case of single precision inputs, converting the returned value to single precision at the end in this case. I think that, yes, the users should be aware of the limitations of floating-point representations (double-precision numbers are only double precision, not more), but with respect to the used algorithms, they can rightly expect better algorithms to be used by default if those are easily to be implemented (where better means more accurate at equal performance).
For reference, also Matlab recently have recognized that naive summation is not so good in terms of accuracy, and have switched to an algorithm that is a bit better (without specifying it in detail) https://blogs.mathworks.com/loren/2021/09/07/a-faster-and-more-accurate-sum/. Thus, there is no reason to feel bound to the current behaviour, but for the future we should of course specify the used algorithm in the documentation.
Finally, to show that the accuracy issues when summing doubles with default options (in the case of 64-bit accumulant precision, as I see here on 64-bit octave) can be significant also for typical problems, I attachtest_sum_integral.m (581 Bytes)
a script that integrates exp(-x) between 0 and 1 by the midpoint rule (to be used with the oct-file sum_pair3.oct as attached to the bug discussion): there are rigorous error bounds, but starting at around N=2000000 function evaluations, the error of the numerical integral (with default summation) is larger than the analytical error bound, decreases only slowly at this range of N, and even increases at very large N. On the other hand, the error of pair-wise summation is always smaller than the analytical bound, and in particular even reaches zero when this analytical bound is smaller than eps.