FFTW( "threads", 3 ) give best performance for 28 cores machine?

Hi all,
Using fft right off the bat on my machine (28 physical cores) , I get an fft usage extremely slow and it makes CPU super busy. If I type fifth(“threads”) I get 56 by default.
After testing FFTW( “threads”, 3 ) is the best parameter and is 250x FASTER !!! and the cpu usage is really low. I had to set fftw(“threads”) in the main thread and all the threads I spawn.

Can someone explain to me why that is ?
If that is the best performance, how come my CPU are not fully used ?
Are there other setup I can use to make fft even faster ?

Thank you !

My system

  • Mac OS 12.4
  • Octave 6.2.2 app

On linux with 16 core/32 threads ryzen:

octave:1> a=randn(5000);
octave:2> fftw ("threads", 1)
octave:3> tic; ifft(fft(a)); toc
Elapsed time is 0.541074 seconds.
octave:4> fftw ("threads", 2)
octave:5> tic; ifft(fft(a)); toc
Elapsed time is 0.462468 seconds.
octave:6> fftw ("threads", 3)
octave:7> tic; ifft(fft(a)); toc
Elapsed time is 0.432886 seconds.
octave:8> fftw ("threads", 4)
octave:9> tic; ifft(fft(a)); toc
Elapsed time is 0.420183 seconds.
octave:10> fftw ("threads", 8)
octave:11> tic; ifft(fft(a)); toc
Elapsed time is 0.40262 seconds.
octave:12> fftw ("threads", 16)
octave:13> tic; ifft(fft(a)); toc
Elapsed time is 0.40099 seconds.
octave:14> fftw ("threads", 32)
octave:15> tic; ifft(fft(a)); toc
Elapsed time is 0.401687 seconds.

It is detrimental to spawn more threads than then number of actual cpu cores, but nothing too drastic.

Can you try with a 2D array in input ? a = rand(251,1E5)

On octave 7.1.0 (my app is busy)

>> fftw("threads",1)
>> a=randn(251,320^2);
>> tic; t=fft(a,251); toc
Elapsed time is 1.07847 seconds.
>> fftw("threads",2)
>> tic; t=fft(a,251); toc
Elapsed time is 0.640221 seconds.
>> fftw("threads",3)
>> tic; t=fft(a,251); toc
Elapsed time is 0.375045 seconds.
**>> fftw("threads",4)**
**>> tic; t=fft(a,251); toc**
**Elapsed time is 15.6041 seconds.**
>> fftw("threads",5)
>> tic; t=fft(a,251); toc
Elapsed time is 12.7182 seconds.

From the get go, without setting fftw:

fftw(“threads”)
ans = 56
a=randn(251,320^2);
tic; t=fft(a,251); toc
Elapsed time is 272.163 seconds.

I can see similar behavior.
Changing fft sample size to 256 speeds things up quite a bit
(it get padded with 0s). I.e. do fft(a, 256).
If you do not want to do that you may try to
make a “wisdom” file for your particular problem.
In any case this is fftw peculiarity rather than octave’s.

p.s.:

octave> tic; t=fft(a,251); toc
Elapsed time is 10.8668 seconds.
octave> whos t
Variables visible from the current scope:

variables in scope: top scope

Attr Name Size Bytes Class
==== ==== ==== ===== =====
c t 251x102400 411238400 double

octave> tic; t=fft(a,256); toc
Elapsed time is 0.26807 seconds.
octave> whos t
Variables visible from the current scope:

variables in scope: top scope

Attr Name Size Bytes Class
==== ==== ==== ===== =====
c t 256x102400 419430400 double

Total is 26214400 elements using 419430400 bytes

fftw(“threads”,3)
a=randn(256,320^2);
tic; t=fft(a,251); toc
Elapsed time is 0.683871 seconds.

a=randn(251,320^2);
tic; t=fft(a,251); toc
Elapsed time is 0.467405 seconds.

a=randn(256,320^2);
tic; t=fft(a,256); toc
Elapsed time is 0.242659 seconds.

a=randn(251,320^2);
tic; t=fft(a,256); toc
Elapsed time is 0.275021 seconds.

I meant for 4 threads.
In general fft performs best when the sample size is 2^N; worst if it is a prime number.

1 Like

I understand. By still the default FFTW(“threads”) value gives very bad performance unless the user sets it to a low value…

See also this from the documentation of the FFTW library on multi-threading:
How Many Threads to Use? (FFTW 3.3.10)
It isn’t very extensive. But quoting it here in case that link goes stale:

There is a fair amount of overhead involved in synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have.

As a general rule, you don’t want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors.

You will have to experiment with your system to see what level of parallelization is best for your problem size. Typically, the problem will have to involve at least a few thousand data points before threads become beneficial. If you plan with FFTW_PATIENT, it will automatically disable threads for sizes that don’t benefit from parallelization.

IIUC, that means that the “optimum” number of threads depends on the problem size.

As far as I can tell, Octave uses gnulib’s num_processors (NPROC_CURRENT_OVERRIDABLE) as the default number of threads. IIUC, that is the number of processors (or the value of the environment variable OMP_NUM_THREADS if it is set to an integer).

On hyperthreaded machine it returns number of hyperthreads. E.g. on my 16-core Ryzen it returns “32”.

May be we should default to “one”. It looks to me that with modern cpu (N>= 8) it is more often
wrong then right. People who care would investigate and set it appropriately.

I tested with the following script on a AMD Ryzen 7 5800X 8-Core Processor with 16 threads (on Windows 11):

a = randn (251, 320^2);
nProc = 16;
timing = zeros (1, nProc);
for nThreads = 1:nProc
  fprintf('setting number of threads for FFTW to %d\n', nThreads);
  fftw ("threads", nThreads)
  nAverages = 10;
  time_iter = zeros (1, nAverages);
  for ii = 1:nAverages
    tic;
    fft (a, 256);
    time_iter(ii) = toc;
  endfor
  timing(nThreads) = mean (time_iter);
  fprintf('average time: %.3f s\n', timing(nThreads));
end

plot(1:nProc, timing);
grid on
xlabel('number of threads');
ylabel('duration in s');

With it, I see these timings:

setting number of threads for FFTW to 1
average time: 0.198 s
setting number of threads for FFTW to 2
average time: 0.179 s
setting number of threads for FFTW to 3
average time: 0.173 s
setting number of threads for FFTW to 4
average time: 0.172 s
setting number of threads for FFTW to 5
average time: 0.170 s
setting number of threads for FFTW to 6
average time: 0.169 s
setting number of threads for FFTW to 7
average time: 0.170 s
setting number of threads for FFTW to 8
average time: 0.170 s
setting number of threads for FFTW to 9
average time: 0.171 s
setting number of threads for FFTW to 10
average time: 0.171 s
setting number of threads for FFTW to 11
average time: 0.171 s
setting number of threads for FFTW to 12
average time: 0.172 s
setting number of threads for FFTW to 13
average time: 0.173 s
setting number of threads for FFTW to 14
average time: 0.173 s
setting number of threads for FFTW to 15
average time: 0.173 s
setting number of threads for FFTW to 16
average time: 0.174 s

image

There isn’t much variation over the board. But higher number of threads seem to be faster that one or two threads.
Setting the default to 1 would not be a good idea at least for this system…

Going from 1 to 16 improves time by ~15%
Change fft(a,256) to fft(a,251).

That’ll take veeery long. But is arguably “bad” user input…

Edit:

a = randn (251, 320^2);
nProc = 16;
timing = zeros (1, nProc);
for nThreads = 1:nProc
  fprintf('setting number of threads for FFTW to %d\n', nThreads);
  fftw ("threads", nThreads)
  nAverages = 1;
  time_iter = zeros (1, nAverages);
  for ii = 1:nAverages
    tic;
    fft (a, 251);
    time_iter(ii) = toc;
  endfor
  timing(nThreads) = mean (time_iter);
  fprintf('average time: %.3f s\n', timing(nThreads));
end

plot(1:nProc, timing);
grid on
xlabel('number of threads');
ylabel('duration in s');
setting number of threads for FFTW to 1
average time: 0.411 s
setting number of threads for FFTW to 2
average time: 0.261 s
setting number of threads for FFTW to 3
average time: 0.212 s
setting number of threads for FFTW to 4
average time: 13.315 s
setting number of threads for FFTW to 5
average time: 8.799 s
setting number of threads for FFTW to 6
average time: 25.128 s
setting number of threads for FFTW to 7
average time: 24.879 s
setting number of threads for FFTW to 8
average time: 25.025 s
setting number of threads for FFTW to 9
average time: 26.246 s
setting number of threads for FFTW to 10
average time: 25.623 s
setting number of threads for FFTW to 11
average time: 56.530 s
setting number of threads for FFTW to 12
average time: 54.886 s
setting number of threads for FFTW to 13
average time: 56.659 s
setting number of threads for FFTW to 14
average time: 58.205 s
setting number of threads for FFTW to 15
average time: 57.971 s
setting number of threads for FFTW to 16
average time: 83.776 s

image

That is my point. I think we want minimize really bad cases.

You’re right. For “good” input (power of 2), it makes hardly a difference. For “bad” input (prime), using more threads seems to be detrimental.

Maybe setting the number of threads to min(2, nproc) or min(3, nproc) would be a good default value?

I tried a few bad cases, like large_prime x 1 (so it cannot parallelize across 2nd dim) or small_prime x large_number, and in all cases 3 were better than 1. In fact I could not construct such a bad example as
the original one. So yes min(3, nproc) sounds a good idea to me.

The question is, why such a jump is duration from 3 to 4 ?

This seems worth implementing, and not that hard. Do you want to file a bug report about this, or just implement it? Documentation for fftw should be updated with an explanation of the default.

Second question, shouldn’t nproc return the number of processors and not the number of threads? The name of the function and the documentation Return the current number of available processors. say that the function will return the number of processors and not the number of threads.

EDIT (8/10/22): The nproc command at the shell is returning the number of threads available on my PC so I guess this is expected behavior.