This is not exactly an issue, but I would like to discuss openblas and the number of threads. The main goal is to do some benchmark with matrix-matrix multiplication using command
>> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
The machine is a Core i3 3110M (2 cores - 4 threads), debian 10, octave 6.1.0 compiled from source.
OpenBLAS 0.3.10 was compiled from source.
If octave is started without any arguments I get this:
octave:1> version -blas
ans = OpenBLAS (config: OpenBLAS 0.3.10 NO_AFFINITY SANDYBRIDGE MAX_THREADS=4)
So it seems (am I wrong?) that 4 threads are being used.
First result:
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.865209 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.856297 seconds.
octave:4> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.844332 seconds.
However, starting octave with 1 thread gives me almost same results:
$ OMP_NUM_THREADS=1 octave --no-gui
octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.915153 seconds.
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.91491 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.899068 seconds.
But starting with
$ OMP_NUM_THREADS=2 octave --no-gui
octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.477346 seconds.
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.470379 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.467869 seconds.
And with:
$ OMP_NUM_THREADS=4 octave --no-gui
octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.927332 seconds.
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.90404 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.961958 seconds.
This was a very simple benchmark, but IMHO the optimal results should have happened with 4 threads.
Do you think the results are correct, i.e., with optimal results being with 2 threads?
Kind regards,
Leonardo