WIP stack VM for Octave

Hi!

Concerning another discussion about VMs for Octave I thought I should share a partial implementation of one I made during sleepy time on my paternity leave.

Essentially it is a custom stack-ish VM which calls Octave Cpp code for the different opcodes. There is one stack for data.

There is still functionality that is missing. Eg.:

  • Catching exceptions (it just unwinds now)
  • eval, assignin, clear etc
  • switch
  • classes
  • stack overflow check
  • Only 256 slots for variables (Need “far slots”)
  • Automatic compilation
    … and more. E.g. it can’t unwind in a for-loop yet.

My approach might need some rework to make stack fiddling with eval and friends work – that’s probably the main annoying todo part missing.

The speed gains of using the VM depends on what code is executed. For scalar for loops it is about 5-10x and for recursive functions it is about 15-20x. About the square root of the speed improvement is due to other changes in the hooked in code and applies to the tree walker interpreter too. There is probably some more low hanging fruit to pick to speed up the VM.

See attached pathes. They are squashed to two patches from many.

octave_31191.patch (839.9 KB)
octave_31192.patch (7.5 KB)

The interesting files are:
pt-bytecode-vm.cc
pt-bytecode-walk.cc

and maybe the test/compile folder.

Usage:

compile foo % Compiles function foo in foo.m to bytecode
compile foo clear % Deletes bytecode
compile foo print % Print opcodes

Example:

function i = testspeed_bc ()
  tic; 
  sortperf(50000);
  toc; % 3.9s -> 0.27s
  tic;
  for j = 1:10000000
    b = 1*2*3*4*5*6*7*8;
  end
  toc; %8.5s -> 1.35s
  tic;
  for j = 1:10000000
    sin (j);
  end
  toc; % 22.2s -> 2.07s
end

function b = qsort(a)
    b = qsort_kernel(a, 1, length(a));
end

function a = qsort_kernel(a, lo, hi)
    i = lo;
    j = hi;
    while i < hi
        pivot = a(floor((lo+hi)/2));
    	while i <= j
              while a(i) < pivot, i = i + 1; end
              while a(j) > pivot, j = j - 1; end
              if i <= j
	      	 t = a(i);
	    	 a(i) = a(j);
	    	 a(j) = t;
            	 i = i + 1;
            	 j = j - 1;
       	      end
    	end
        if lo < j; a=qsort_kernel(a, lo, j); end
        lo = i;
	j = hi;
    end
end

function v = sortperf(n)
    v = rand(n,1);
    v = qsort(v);
end
1 Like

Recording my experiences so far.

I made a local clone of the octave source tree for this VM. It gave me a warning on applying the patches:

$ hg import octave_31191.patch
applying octave_31191.patch
warning: import the patch as a normal revision
(use --exact to import the patch as a merge)

$ hg import octave_31192.patch
applying octave_31192.patch

But using --exact caused an error about unknown revisions, so I returned to the above. The patches applied to the default branch:

$ hg heads
changeset:   31102:38f8ea8994d5
bookmark:    @
tag:         tip
user:        Petter Tomner <tomner@kth.se>
date:        Tue Jun 21 12:42:25 2022 +0200
summary:     WIP: Fixing merge problems

changeset:   31097:b47bf773c508
branch:      stable
parent:      31094:a30f6c08d7b5
user:        Markus Mützel <markus.muetzel@gmx.de>
date:        Mon Jun 20 19:31:21 2022 +0200
summary:     canonicalize_file_path: Do not translate mapped network drive to UNC path (bug #62576).

After that, bootstrap and configure were uneventful, which was good. The configure flags I used were the same as for my primary Octave tree:

../configure --enable-std-pmr-polymorphic-allocator --enable-largefile --enable-year2038 --enable-threads=posix --enable-64 --enable-openmp --enable-readline --disable-java --enable-docs

Building was successful with Clang 13.0.1 and GCC 12.1.0, but each gave many instances of specific warnings . Warnings from Clang:

../liboctave/util/tiny_vector.h:130:35: warning: unknown attribute 'noclone' ignored [-Wunknown-attributes]

../libinterp/parse-tree/pt-bytecode-vm.h:229:21: warning: unknown attribute 'optimize' ignored [-Wunknown-attributes]
    __attribute__ ((optimize("no-gcse","no-crossjumping")));
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Warnings from GCC:

../libinterp/octave-value/ov.h:1784:9: warning: 'void operator delete(void*, std::size_t)' called on unallocated object 'octave_value::nil_ov_base' [-Wfree-nonheap-object]
 1784 |   class octave_nil : public octave_base_value {
      |         ^~~~~~~~~~


../liboctave/array/Array.cc:1889:19: warning: '*(octave::tiny_vector<long int, 3>*)((char*)&dv + offsetof(dim_vector, dim_vector::m_data)).octave::tiny_vector<long int>::m_data.octave::tiny_vector<long int>::vec_or_arr::m_arr[<unknown>]' may be used uninitialized [-Wmaybe-uninitialized]
 1889 |   octave_idx_type ns = dv(dim);
      |                   ^~

Results with the Clang version:

Baseline: [EDIT: This is with the patches but without function compilation.]

octave:3> testspeed_bc
Elapsed time is 1.88821 seconds.
Elapsed time is 2.64858 seconds.
Elapsed time is 4.3992 seconds.

With the VM:

octave:4> compile testspeed_bc
ans = 1

octave:5> testspeed_bc
Elapsed time is 0.33287 seconds.
Elapsed time is 1.51787 seconds.
Elapsed time is 1.93973 seconds.

Results with the GCC version:
Baseline: [EDIT: This is with the patches but without function compilation.]

octave:3> testspeed_bc
Elapsed time is 1.93343 seconds.
Elapsed time is 2.77571 seconds.
Elapsed time is 4.82576 seconds.

With the VM:

octave:4> compile testspeed_bc
ans = 1

octave:5> testspeed_bc
Elapsed time is 0.335549 seconds.
Elapsed time is 1.57483 seconds.
Elapsed time is 2.39864 seconds.

Overall very promising! Will be testing more.

Question to @Petter:
Is there a way to apply compile to script files or do they have to be Octave functions?

Nice to see it compiles and works with Clang. I’m new to hg so I might have messed up the patch exporting.

I figure “baseline” is with the patches applied? It would be interesting to see the runtime on stock Octave on someone else’s computer.

No, there is only support for compiling functions. To compile scripts there need to be support for accessing the base workspace/global variables. I see no big obstacle in implementing that.

In general I think the main hurdle would be functions that mess with the VM stack (clear, eval, assignin) and maybe debugger support.

Yes, “Baseline” in my previous comment was referring to the patches applied but the function not compiled. Here is the “Original baseline” for Octave without the patches applied. This was built with Clang 13. The version with GCC 12.1 is slightly slower.

octave:1> testspeed_bc
Elapsed time is 3.66793 seconds.
Elapsed time is 6.94847 seconds.
Elapsed time is 23.1186 seconds.

I will edit the previous comment for clarity.

EDIT: Now in table form:

Test \ Context Original Octave (Not patched) Patched only Patched and compiled Patch speedup Compile speedup
sortperf(5e4) 3.67s 1.89s 0.33s 1.9x 11x
for j = 1:1e7, b = 1*2*3*4*5*6*7*8; end 6.95s 2.65s 1.52s 2.6x 4.6x
for j = 1:1e7, sin (j); end 23.12s 4.40s 1.94s 5.3x 12x

What is the difference between “patched only” and “patched and compiled”?

In @Petter’s code, there is a special command called compile that converts an Octave m-function into byte code. To use that, we do need to apply his patches (patched & compiled), but one can also apply his patches and still execute uncompiled m-functions as interpreted code, not as bytecode (patched only, not compiled).

So what changes in the “patched only” code contribute most to the performance improvement? Maybe we could start there.

@Petter would be more authoritative to answer it. The things I could spot were:

  • Replacing dynamic_cast with static_cast in several places through a macro DYNORSTAT_CAST

  • A memory pool for local objects. As with all memory pools, it cuts down the overhead of dynamic allocation. The new and delete operators are overloaded appropriately.

  • Changing certain return-by-value functions to return-a-reference or return-a-const-reference patterns. Generally more use of const foo& fcn instead of foo fcn.

  • Avoiding repeated calls to typeid by replacing them with a fixed enum? (This also replaces string comparison with integer comparison.)

The cast change is here: GNU Octave - Patches: patch #10139, interpreter: static instead of... [Savannah]. There was some discussion.

The memory pool change is here: GNU Octave - Patches: patch #10160, interpreter: Object pool for... [Savannah]. There is no discussion in the tracker, but I recall a brief mention in a developer’s meeting.

Returning const reference is here: GNU Octave - Patches: patch #10147, interpreter: Avoid string... [Savannah]. Would move assignment operators also possibly solve this problem?

The typeid → enum change is part of GNU Octave - Patches: patch #10167, interpreter: Speedup patchset [Savannah] in the file octave_30618.patch from the attached zip file. There are 24 separate changes in that one patch tracker item. It would probably be helpful to address them separately as I think discussing all of those changes in a single tracker item will become confusing.

@jwe

I got the “non-VM” patches split up here instead of in the messy merge blob:
speedpatches 20220621.zip (115.8 KB)

It is some more than in the patchset I dropped in Savannah earlier. Sorry for Swedish commit messages on some =)

It is hard to say what contributes most to performance since the speed improvements like block each other unless they are all done.

But in general I think the major performance improvement comes from:

  • Less use of malloc. (Memory pools, inplace “tiny_vector”, less copying more reference use, const static strings)
  • Caching of function lookups
  • dynamic_cast → static_cast
  • Non-atomic octave_value reference counter
  • Nudging the Cpp compiler to do the right thing …

I don’t think any of those changes is noticeable for the user. Or at least that is how I tried to do it.

Some patches, like 30632, were dead ends and probably not worth the complexity.

I thought we needed atomic reference counts to allow passing these objects between the interpreter and GUI threads. Is that not needed?

Ye. Dropping such a huge patch set like this is not optimal. I could start drop feeding speed related patches if you want. My main goal was to try to make a VM for the fun of it, and the speed patches were a bonus I had to do to not get stuck performance wise.

Ye I am not too sure about that particular change. I have not looked into the GUI code. Uses of liboctave could be an issue too.

Almost. If I remember correctly the difference is that the id becomes an “immidiate literal” instead of some value in data memory. I.e. part of the instruction and has not to be loaded separately.

I’m willing to help work through these changes, but would prefer that each is a separate item on the patch or bug tracker. I can insert them all into the patch tracker with a link back to this thread. Then we can address them individually.

I submitted all of the changesets with “interpreter: …” comments to the patch tracker.

I also accidentally submitted one of the “WIP” changesets. Let me know if you would prefer to wait until later for the rest of them. But I think it would be fine to discuss them now and I have them ready to upload.

I skipped the octave_30621.patch file since it was already uploaded for bug #60620.

Sounds great. It might take a while the churn through them, but I believe it might be worth it. There is no rush on my part. We could begin with the non-“WIP” ones and make it slow so I didn’t introduce any memory leaks or other problems.

Have you looked at the VM? I would mainly see it as a proof-of-concept at this point. The main caveat in my view is if it is possible to implement support for eval () etc without messing up the execution speed.

I only took a quick look at the VM.

If it is OK with you, I will also upload most of the WIP changesets. It might be good to look at them as well, even if they are not completely finished.

I don’t think it would be slower to generate bytecode and evaluate that vs. evaluating the parse tree recursively. So I wouldn’t worry too much about that.

The last change that removed the string argument from interpreter functions seems to have caused compilation errors for at least one Octave package in MXE Octave, e.g.:
http://buildbot.octave.org:8010/#/builders/21/builds/1364/steps/9/logs/stdio

The displayed tail isn’t long enough to determine which function is causing the error afaict. (Maybe __get_type_info__?)
But should we add deprecated overloads of those functions that still accept that string argument for a couple of versions?

The file interpreter-private.h is not installed and those functions are supposed to be private, so I don’t think we need to deprecate and preserve the old signatures. However, there is a declaration of __get_type_info__ in ov-base.h because it is used in a macro that is also defined in ov-base.h to help with defining new octave_value types. The usage should be consistent with the declaration. But it looks like sparsersb is not using the macros from ov-base.h. Since this problem should only affect __get_type_info__, I can preserve the old signature for that function but I don’t think we need to worry about the rest.

EDIT: For now, I did this: octave: d98abdd15d40

@Petter: I notice an odd repeatable crash with the VM patches. Is this expected for now based on the WIP status?

This is with your patches 31191 and 31192 but without calling compile on any function. It works properly.

octave:1> n = 12345; tic, f = factor(n), toc
f =

     3     5   823

Elapsed time is 0.00205708 seconds.

Then I restarted the same patched version of Octave and tried calling compile on factor:

octave:1> compile factor
Compilation failed
VM error, Not done yet 748: Only = assignment supported yet
ans = 0
octave:2> n = 12345; tic; f = factor(n), toc
fatal: caught signal Segmentation fault -- stopping myself...
Segmentation fault (core dumped)

If I’m reading this correctly, the failed compilation into bytecode does not rollback properly but leaves it in an unusable state until Octave is restarted?

@ arungiridhar

Oh ye the “TODO()” macro just throws an exception without bothering to clean up properly. That are the “+=” etc assignment operators that are not done yet. Maybe I should add some user friendliness for that =)

I ran your hilb benchmark with some small modifications to make it work.

stock 6.2.0 patched compiled
270s 171s 113s
1x 1,57x 2,39x
1x 1,53x

hilb_bc.m (3.1 KB)

I believe the “==”-operator in

 if (i=="F") # step front

etc is what slows it down the most. There could be an idea to do a special case for char arrays of length one so a logical array does not have to be allocated each “==” or something. It is probably a quite common idiom. There is a lot of malloc calls.

The other culprit is probably

orientation = orientation(yawleft)

since there is no memory pool for vectors. Maybe there should be one.