Why is sparse matrix exponentiation slower than iterated multiplication?

It is usually faster to do matrix exponentiation A ^ n rather than repeated multiplication A * A * A * ... n times where the exponent n is some positive integer in this case. This is indeed the case for full matrices, but sparse matrix exponentiation is unexpectedly slower than repeated multiplication.

I have an adjacency matrix representing a large and sparse state transition graph:

>> whos adj
Variables visible from the current scope:

variables in scope: top scope

  Attr   Name        Size                     Bytes  Class
  ====   ====        ====                     =====  ===== 
   s     adj     25515x25515                1047824  double

Total is 651015225 elements using 1047824 bytes


>> numel (adj)
ans =        651015225

>> nnz (adj)
ans =            52731

>> unique (adj(:))
ans =
                       0
                       1

(It is represented as double instead of logical because it will be multiplied soon.)

It is a well-known property of 0/1 adjacency matrices that the (i,j)-th element of adj ^ n gives the number of paths of length n from i to j.

If we call adj ^ n and measure the time:

>> t = zeros(1,15); for i = 1:15, tic; tmp = adj ^ i; t(i) = toc; disp([i t(i)]); end
                       1     0.07019686698913574
                       2    0.003432989120483398
                       3      0.0135948657989502
                       4     0.03022098541259766
                       5      0.1219630241394043
                       6      0.2582740783691406
                       7      0.5015640258789062
                       8       0.906404972076416
                       9       3.269675016403198
                      10       5.212830066680908
                      11       7.401607036590576
                      12       11.35441994667053
                      13       16.15549778938293
                      14       27.15614080429077
                      15       51.56035304069519

If we do adj * adj * adj ... n times and measure the time:

>> u = zeros(1,15); for i = 1:15, tic; tmp = eval(["adj" repmat(" * adj", 1, i-1)]); u(i) = toc; disp([i u(i)]); end
                       1     0.07631683349609375
                       2    0.003623008728027344
                       3     0.01030302047729492
                       4     0.02698016166687012
                       5     0.06217288970947266
                       6      0.1419768333435059
                       7      0.3206641674041748
                       8      0.7241461277008057
                       9       1.282343149185181
                      10       2.265716075897217
                      11       3.962852001190186
                      12       6.107408046722412
                      13       7.560194969177246
                      14       9.875128030776978
                      15       14.05766201019287

Compare the times t and u, and it’s at least 3 times faster to do repeated multiplication instead of exponentiation, even with eval.

Questions:

  1. Why does this happen?

  2. Should we intercept the case where a sparse matrix is being exponentiated to a positive integer and do repeated multiplication behind the scenes for performance?

Profiling only localizes the problem to the ^ operator, which is not new information.

>> profile on; tmp = adj ^ 15; profile off; T = profile ("info"); profshow(T)
   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
   1            binary ^            49.100     100.00            1
   2             profile             0.000       0.00            1
   4            binary <             0.000       0.00            1
   3              nargin             0.000       0.00            1
   5               false             0.000       0.00            1
   6 __profiler_enable__             0.000       0.00            1
>> profile on; tmp = adj*adj*adj*adj*adj*adj*adj*adj*adj*adj*adj*adj*adj*adj*adj; profile off; T = profile ("info"); profshow(T)
   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
   1            binary *            12.007     100.00           14
   2             profile             0.000       0.00            1
   5               false             0.000       0.00            1
   3              nargin             0.000       0.00            1
   4            binary <             0.000       0.00            1
   6 __profiler_enable__             0.000       0.00            1

Is it correct that the slower runtime of mpower is because it calls matrix decomposition routines for all inputs even if the exponent is only a positive integer? If so, why are only sparse matrices affected instead of all matrices?

IIUC, the power operator for sparse matrices with integer exponent comes down to this implementation:
octave: 94488ab70e12 libinterp/corefcn/sparse-xpow.cc (gnu.org)

// Safer pow functions.  Only two make sense for sparse matrices, the
// others should all promote to full matrices.

octave_value
xpow (const SparseMatrix& a, double b)
{
  octave_value retval;

  octave_idx_type nr = a.rows ();
  octave_idx_type nc = a.cols ();

  if (nr == 0 || nc == 0 || nr != nc)
    error ("for A^b, A must be a square matrix.  Use .^ for elementwise power.");

  if (! xisint (b))
    error ("use full(a) ^ full(b)");

  int btmp = static_cast<int> (b);
  if (btmp == 0)
    {
      SparseMatrix tmp = SparseMatrix (nr, nr, nr);
      for (octave_idx_type i = 0; i < nr; i++)
        {
          tmp.data (i) = 1.0;
          tmp.ridx (i) = i;
        }
      for (octave_idx_type i = 0; i < nr + 1; i++)
        tmp.cidx (i) = i;

      retval = tmp;
    }
  else
    {
      SparseMatrix atmp;
      if (btmp < 0)
        {
          btmp = -btmp;

          octave_idx_type info;
          double rcond = 0.0;
          MatrixType mattyp (a);

          atmp = a.inverse (mattyp, info, rcond, 1);

          if (info == -1)
            warning ("inverse: matrix singular to machine precision, rcond = %g", rcond);
        }
      else
        atmp = a;

      SparseMatrix result (atmp);

      btmp--;

      while (btmp > 0)
        {
          if (btmp & 1)
            result = result * atmp;

          btmp >>= 1;

          if (btmp > 0)
            atmp = atmp * atmp;
        }

      retval = result;
    }

  return retval;
}

I haven’t dug any further.

Thanks @mmuetzel.

I added timing statements to the xpow function in sparse-xpow.cc to measure the speed. The instrumented function is here:
new.cc (2.9 KB)

If I ask it to compute adj ^ 15, it uses internal variables to compute this sequence:

atmp   = adj;
result = adj;

atmp   =   atmp * atmp;   # == adj^2
result = result * atmp;   # == adj^3
atmp   =   atmp * atmp;   # == adj^4
result = result * atmp;   # == adj^7
atmp   =   atmp * atmp;   # == adj^8
result = result * atmp;   # == adj^15

It takes only 1.3 seconds to compute up to the penultimate line, by which point result == adj^7 and atmp == adj^8, then the last line which is effectively adj^7 * adj^8 takes 48 seconds, so about 49.4 seconds in total.

The observation is that the time to multiply two sparse matrices together is strongly dependent on the count of their non-zero elements (nnz values).

The way the loop divides the work inside xpow, it needs only six matrix multiplications to compute adj ^ 15 but the last of them is the multiplication of adj^7 * adj^8 which is very dense and therefore time-consuming. The last step’s matrices have nnz values of 10084689 and 20417240. Their product is 2.05e14, which is an ad hoc measure of the difficulty of the multiplication.

By contrast, the following linear sequence takes 14 multiplications instead of 6, but is 3X faster because the nnz values are lower:

>> tic; tmp = adj; s = t = toc; for i = 2:15, tic; tmp = tmp*adj; t = toc; s += t; disp([i nnz(tmp) nnz(adj) t s]); end
       2                  123287                   52731        0.00360107421875      0.0803229808807373
       3                  293563                   52731    0.007605075836181641     0.08792805671691895
       4                  723532                   52731     0.01813507080078125      0.1060631275177002
       5                 1821997                   52731     0.03706502914428711      0.1431281566619873
       6                 4462682                   52731     0.08295392990112305      0.2260820865631104
       7                10084689                   52731      0.1829390525817871      0.4090211391448975
       8                20417240                   52731      0.3539738655090332      0.7629950046539307
       9                36348384                   52731      0.6704421043395996        1.43343710899353
      10                57837218                   52731       1.032677888870239        2.46611499786377
      11                84296754                   52731       1.517933130264282       3.984048128128052
      12               111963528                   52731       1.812459945678711       5.796508073806763
      13               135810390                   52731       2.357160091400146       8.153668165206909
      14               154540204                   52731       2.447211027145386       10.60087919235229
      15               167744705                   52731       2.792289972305298       13.39316916465759

The last step’s matrices have nnz values 154540204 and 52731, whose product is only 8.15e12. Even adding up that product over the 14 multiplications, it adds up to only 32e12, still much faster than multiplying two dense matrices together.

The tradeoff is therefore:

  • The current xpow uses a logarithmic number of multiplications but its matrices get rapidly denser.

  • The linear chain uses a linear number of multiplications but one of the matrices is always sparse.

  • For sufficiently large exponents, the logarithmic technique will outperform the linear multiplication, but that threshold is somewhere around 40-50 based on a few experiments. For numbers below that threshold, it is faster to do a simple linear chain.

  • The sparser the original matrix, the higher the threshold will be: sparser matrices favor linear multiplication.

I’ll prepare a test patch in a day or two that uses linear multiplication for exponents less than some estimated threshold, and uses the current technique for larger ones. We could use that to measure performance.

BTW, if you want to try the original adj I used, it’s here in Octave text format:
adj.txt (671.8 KB)
(use load adj.txt, or you can use any other test matrix of course).

Here’s the patch:
sparse-xpow.patch (2.2 KB)

It consistently speeds things up by 1X to 4X for me for real-life data. It could use more testing with other matrices to get a better value for threshold.

Managed to obtain a comparison overnight. All times are averaged over 10 runs.

Matrix order NNZ A^15 Unpatched (s) A^15 Patched (s) A^27 Unpatched (s) A^27 Patched (s)
37 55 4.8685e-05 5.095e-05 6.1679e-05 5.9891e-05
80 128 0.0002455 0.00024111 0.00048048 0.00048935
96 148 0.0001714 0.00017078 0.00024743 0.00024734
135 256 0.00077558 0.00074873 0.0011355 0.0011659
576 944 0.0034613 0.0041948 0.010721 0.010716
513 1005 0.01279 0.0053149 0.026162 0.027873
621 1242 0.070274 0.015379 0.20087 0.2205
1024 1728 0.018186 0.019714 0.16838 0.16702
1345 2243 0.015245 0.016588 0.083618 0.080235
6913 11779 0.29151 0.21983 3.9252 0.77014
6561 13311 2.8144 0.51711 17.991 1.454
11264 19456 1.0456 0.95784 37.53 4.938
15360 26368 1.0272 0.89489 24.84 3.7154
22599 46332 26.809 5.1304 386.89 15.547
25515 52731 53.4 12.527 2167.7 47.985
TOTAL TOTAL 85.509 20.304 2639.4 74.918

Since the patched version is so much faster than the unpatched one, I’ll push the patch out later today to get a wider test audience.

EDIT: Pushed to octave: 45984c799215
Ready for test.

1 Like

I won’t be able to test for a week or so. But should something similar be done also for complex sparse matrix exponentiation? (A few lines down in the same file.)

Yes, the tests for complex are in progress. They are running much more slowly than the real case so it could be a few days to understand whether it needs to be changed.

1 Like

Finished the run for the complex values. All times are averaged over 10 runs.

Matrix order NNZ A^15 Unpatched (s) A^15 Patched(s) A^27 Unpatched (s) A^27 Patched (s)
37 55 6.2656e-05 6.33e-05 7.7558e-05 6.7997e-05
80 128 0.00028791 0.00029867 0.00060687 0.00061529
96 148 0.0001914 0.00019517 0.00029328 0.00029461
135 256 0.00096793 0.00097463 0.0014704 0.0014707
576 944 0.0041785 0.0050792 0.013573 0.013875
513 1005 0.016765 0.0070391 0.035326 0.03474
621 1242 0.082263 0.020143 0.26953 0.27005
1024 1728 0.021313 0.025106 0.21333 0.2071
1345 2243 0.018935 0.021108 0.097467 0.098959
6913 11779 0.34025 0.32853 4.8104 1.1531
6561 13311 3.2174 0.80879 21.14 2.0761
11264 19456 1.3037 1.3903 46.802 7.0871
15360 26368 1.2126 1.2745 30.039 5.3255
22599 46332 30.587 8.2568 468.62 22.03
25515 52731 62.252 20.678 2668.3 65.522
TOTAL TOTAL 99.058 32.817 3240.3 103.82

The data file is too large to upload to Discourse but I obtained them by taking the real matrices from the previous series of tests and adding complex (0, randn) to each of the nonzero elements, then normalizing it to an absolute magnitude of 1.

The behavior trend is the same as before for the real case.

I’ll prepare and push a patch soon so this too can be tested widely.

EDIT: Pushed here: octave: 3eab70385569

On a related topic, there is this helper function at the top of the file:

static inline int
xisint (double x)
{
  return (octave::math::x_nint (x) == x
          && ((x >= 0 && x < std::numeric_limits<int>::max ())
              || (x <= 0 && x > std::numeric_limits<int>::min ())));
}

Is there any objection to changing the return type to bool instead of int? (Why does it return an int anyway?)

Afaics, that function is only used in a Boolean context.
Additionally, it is a static function in a .cc file. So, it is probably save to change the output type to bool.
Since it is a function that is probably inlined everywhere it is used, it might not matter. But it’s probably clearer for a programmer that looks at the file…

Thanks. I pushed the int to bool change here: octave: 44cf6bbeeca9

I’d guess that it was defined to return int instead of bool because it was written before people were in the habit of using bool. If I remember correctly, Octave has been around longer than bool has been a fundamental data type in C++.

1 Like

I thought the same, that it predates the bool type in C++. I recall having to write my own enum boolean from that era. But as @mmuetzel said, changing it to bool is only for future programmers not the code.