This is not exactly an issue, but I would like to discuss openblas and the number of threads. The main goal is to do some benchmark with matrix-matrix multiplication using command

>> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

The machine is a Core i3 3110M (2 cores - 4 threads), debian 10, octave 6.1.0 compiled from source.

OpenBLAS 0.3.10 was compiled from source.

If octave is started without any arguments I get this:

octave:1> version -blas

ans = OpenBLAS (config: OpenBLAS 0.3.10 NO_AFFINITY SANDYBRIDGE MAX_THREADS=4)

So it seems (am I wrong?) that 4 threads are being used.

First result:

octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.865209 seconds.

octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.856297 seconds.

octave:4> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.844332 seconds.

However, starting octave with 1 thread gives me almost same results:

$ OMP_NUM_THREADS=1 octave --no-gui

octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.915153 seconds.

octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.91491 seconds.

octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.899068 seconds.

But starting with

$ OMP_NUM_THREADS=2 octave --no-gui

octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.477346 seconds.

octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.470379 seconds.

octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.467869 seconds.

And with:

$ OMP_NUM_THREADS=4 octave --no-gui

octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.927332 seconds.

octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.90404 seconds.

octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

Elapsed time is 0.961958 seconds.

This was a very simple benchmark, but IMHO the optimal results should have happened with 4 threads.

Do you think the results are correct, i.e., with optimal results being with 2 threads?

Kind regards,

Leonardo