OpenBLAS and number of threads

This is not exactly an issue, but I would like to discuss openblas and the number of threads. The main goal is to do some benchmark with matrix-matrix multiplication using command

>> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc

The machine is a Core i3 3110M (2 cores - 4 threads), debian 10, octave 6.1.0 compiled from source.

OpenBLAS 0.3.10 was compiled from source.

If octave is started without any arguments I get this:

octave:1> version -blas
ans = OpenBLAS (config: OpenBLAS 0.3.10 NO_AFFINITY SANDYBRIDGE MAX_THREADS=4)

So it seems (am I wrong?) that 4 threads are being used.

First result:

octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.865209 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.856297 seconds.
octave:4> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.844332 seconds.

However, starting octave with 1 thread gives me almost same results:

$ OMP_NUM_THREADS=1 octave --no-gui

octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.915153 seconds.
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.91491 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.899068 seconds.

But starting with

$ OMP_NUM_THREADS=2 octave --no-gui

octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.477346 seconds.
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.470379 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.467869 seconds.

And with:

$ OMP_NUM_THREADS=4 octave --no-gui

octave:1> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.927332 seconds.
octave:2> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.90404 seconds.
octave:3> A=randn(2000);B=A;C=zeros(size(A));tic;C=A*B;toc
Elapsed time is 0.961958 seconds.

This was a very simple benchmark, but IMHO the optimal results should have happened with 4 threads.

Do you think the results are correct, i.e., with optimal results being with 2 threads?

Kind regards,
Leonardo

It looks like hyperthreading might actually degrade the performance of OpenBLAS. Others seem to have observed bad performance for hyperthreading as well.
E.g.: Performance issue with many cores · Issue #1881 · xianyi/OpenBLAS (github.com)
Or: speed.pdf (wfu.edu)
It might be best (wrt performance) to set the number of threads for OpenBLAS to the number of physical cores.

@mmuetzel Thanks for those links.

Indeed it seems that OpenBLAS works better with physical cores only.

Kind regards,
Leonardo