A small benchmark for interpreted code for three versions of Octave

I recently had the opportunity to revisit some old Octave code most of whose runtime was inside a for-loop with an if-else tree inside it, with very few library function calls and no opportunity for vectorization or broadcasting. hilb_bench.m (3.1 KB)

I used it as a benchmark on different versions of Octave and got the following results:

  • Octave 5.2.0: 281 ± 3 seconds.
  • Octave 7.1.0: 243 ± 3 seconds.
  • Octave 8.0.0: 234 ± 3 seconds.

Anyway, I thought it was a small but nice demo of continued improvement over time for the part of Octave that does not benefit from compiled C++ functions, optimized libraries like OpenBlas, or techniques like vectorization or broadcasting. If we get JIT working again in future, this sort of Octave code would be the first to benefit.

(I would have tested more versions but old versions of Octave no longer build easily for me, and version 6.x does not build because of the Sundials library API change. All versions were built afresh.)

BTW, if anyone is interested in creating a proper benchmark suite to test for performance regressions, I’m happy to contribute.

1 Like

Does the profiler tell anything interesting about what it’s been spending its time on and what parts have improved?

1 Like

Profiling was of limited use. The vast majority was self-time that was not covered by the component operators, so it must be time taken to interpret and execute the if-else ladder.

The profile command for all versions:

profile off; profile clear; profile on; hilb_bench; profile off; T = profile("info"); profshow(T)

This is for 5.2.0:

   #          Function Attr     Time (s)   Time (%)        Calls
----------------------------------------------------------------
   1        hilb_bench           263.576      86.87            1
   6         binary ==            17.770       5.86     78448317
  20               toc             7.249       2.39     34728321
  27          binary -             4.643       1.53     34728193
  29         prefix ++             3.800       1.25     16777216
  31         binary >=             3.563       1.17     34728187

This is for 7.1.0:

   #          Function Attr     Time (s)   Time (%)        Calls
----------------------------------------------------------------
   1        hilb_bench           229.939      83.20            1
   5         binary ==            26.486       9.58     78448317
  20               toc             5.564       2.01     34728311
  27          binary -             4.508       1.63     34728193
  29         prefix ++             3.882       1.40     16777216
  31         binary >=             3.054       1.10     34728187

This is for 8.0.0:

  #          Function Attr     Time (s)   Time (%)        Calls
----------------------------------------------------------------
   1        hilb_bench           219.797      82.26            1
   5         binary ==            26.946      10.08     78448317
  20               toc             6.225       2.33     34728307
  27          binary -             4.697       1.76     34728193
  31         binary >=             3.535       1.32     34728187
  29         prefix ++             3.122       1.17     16777216

I did notice that the following two things made a difference between different versions of the benchmark implementation (not to be confused with Octave versions). This was true for all Octave versions tested:

  • Calling code like foo = foo([front up down left right back]); inside the loop is about 10% to 15% slower than calling foo = foo (rollleft); where rollleft has been defined outside the loop to be [front up down left right back]. Evidently parsing and constructing new vectors on the fly takes more time than reusing the same identifier. The linked version above does use this pattern.

  • Changing the single characters "+", "<" etc to named identifiers like yleft etc did not help, likely because they were already single characters, but converting all character types to numeric (double, single or int8, all largely the same performance) did reduce runtime by another 10% to 15%. I chose not to do that for the benchmark as linked above, but it indicates character processing could be sped up by 10% to 15% by converting to a numeric representation.

But neither of those observations explains why 8.0.0 is faster than 5.2.0 by rougly 17% or where the next improvement will come from, except JIT compiling small sections like if-else ladders or for-loops. Maybe using the coder package and converting a time-consuming loop to a separate function, then to an oct file of that function would be a first step for an end user, though it cannot be done in JIT mode.

Octave has slowed down quite a bit over the years. It’s not entirely clear why. Attempts to run profiling tools on the C++ code base have not been particularly successful. There is a lot of memory allocation/de-allocation which might be one root cause, but it still hasn’t been definitively identified.

I had thought for a while that it might be related to the construction of temporary octave_value() and octave_value_list() objects that are returned from Octave DEFUN functions, but I couldn’t pinpoint it to that in the end.

I keep compiled versions of Octave going quite far back. I ran the hilb_bench.m code using

octave-XXX -f hilb_bench.m

where XXX was the version number and the -f switch was used so there were no user configurations which might change the test results. Timing were

octave-dev: 293.7
octave-6.4: 281.4
octave-5.2: 392.1
octave-4.4: 403.4
octave-4.0: 335.7
octave-3.8: 267.7
octave-3.6: 196.1
octave-3.4: 164.6

I would love to sort this out. It probably needs to be a specific project, along the lines of a Google Summer of Code idea. Could also bring it up at the next Octave Developer’s Meeting.

2 Likes

@rik thank you for the longer data from older versions. We both see the speedup from 5.2 to current, but I’d had no idea that 3.x was then fast. Of course back then I used a slow 2011-era computer. I’ll try to get 3.4 built if I can. (Edit: giving up on 3.4 – build keeps failing with a sequence of one gnulib error after another.)

I’m in favor of the debottlenecking operation you describe. It’ll need more detailed C++ profile information that what we have currently and is a project in its own right.

I ran a “poor man’s profiler” on octave while it was running hilb_bench.m. From 1000 samples, the call stack had these for all samples, and everything else was at 21 samples or fewer.

   1000 poll,??,g_main_context_iteration,QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>),QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>),QCoreApplication::exec(),octave::base_qobject::exec(),octave::qt_application::execute(),main

   1000 poll,??,g_main_context_iteration,QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>),QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>),QThread::exec(),??,??,start_thread,clone

   1000 poll,??,xcb_wait_for_event,??,??,start_thread,clone

   1000 sigtimedwait,sigwait,signal_watcher,start_thread,clone

Looks like it spends the majority of its time polling for new events in the queue, or waiting for events – is that an accurate interpretation? What is that “sigtimedwait” / “sigwait” for? Are they connected to the sleep statements in parse-tree/pt-eval.cc?

The “poor man’s profiler” I ran was this:

#!/bin/bash
nsamples=1000
sleeptime=0.1
pid=$(pidof lt-octave-gui)   ## (pidof octave-gui) is OK too

for x in $(seq 1 $nsamples)
  do
    gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $pid
    sleep $sleeptime
  done >tmp.tmp

cat tmp.tmp | awk '
  BEGIN { s = ""; } 
  /^Thread/ { print s; s = ""; } 
  /^\#/ { if (s != "" ) { s = s "," $4} else { s = $4 } } 
  END { print s }' | sort | uniq -c | sort -k1nr
1 Like

@rik, I’ve added this topic to the developers meeting next week. Online Developer Meeting (2022-04-26) - Octave

1 Like

Comment on this experiment. I changed this in the function toc (in libinterp/corefcn/data.cc near line 6890), from this:

  octave_value retval;
  if (nargout > 0)
    retval = etime;
  else
    octave_stdout << "Elapsed time is " << etime << " seconds.\n";

  return retval;

to this:

  if (nargout > 0)
    return octave_value (etime);
  else
    {
      octave_stdout << "Elapsed time is " << etime << " seconds.\n";
      return octave_value ();
    }

The thinking was to see if constructing an empty octave_value and then assigning etime to it was slower than directly initializing it with etime.

Testing it with this code:

octave:1> f = 0; tic; for i = 1:1e7, f = toc; end; toc

several times indicates that the new code is about 1.4% faster than the old code, which may seem small, but that pattern shows up very frequently in the whole codebase: creating an empty retval using the default constructor and then assigning the value to it in a second step.

I am contemplating whether to change as many instances of that pattern to the “direct construction with the final value” pattern throughout the codebase as I can. Acknowledging it’ll be a slow and tedious project, are there any blockers to such a change?

1 Like

would have to sift through past bug reports and discourse conversations, but I guess it would be worthwhile to start looking for a lot of the past performance related discussions for other issues. I recall other things similar to your note above about assigning data to a variable before passing it to a function, like unexpectedly large same-file subfunction overhead versus ‘flat code’ all in the main function, all having an unexpectedly heavy impact on performance.

might be worth trying to collect issues, even if we don’t have a clue on how to address them yet, to put a solid benchmark set together. we’ve always expected the looping/vectorizing difference to be big, but the other stuff is sometimes surprising.

(Wiki post, edit as needed)

I took @nrjank’s advice and compiled a short list of related threads. This is in approximate order from wide-ranging to narrow and focused:

2 Likes

Was looking through the warning options available in gcc and there’s one called -Weffc++ which flags several patterns that could unintentionally lose performance. Several of those warnings are “X should be initialized in the member initialization list” which refers to an object being constructed twice, once with the default constructor and then with the actual values. Not sure yet if they’re false positives but they’re worth looking into.

1 Like