Dear maintainers,
this is a question related to what might look like an unexpected result of inferior efficiency of the setdiff function in finding complimentry set of indices of a vector given one set, compared to using find function twice.
While setdiff seems to be improving efficiency compared to “2 * find()” for relatively small sets of data up to ~1000, it quickly looses comparative efficiency as data sets get larger.
It seems counterintuitive. Manipulating ordered sets compared to searching random vector - granted find function most likely got quite a bit more attention from developers.
Got an impression from a recent python refresh class that set operations were highly efficient granted memory penalty. Well Octave is not python, but algorythms should be similar ?
In a recent excercise for an ML classification problem (result of the model is either 0 = failure or 1 = success) I came up with an idea of using setdiff instead of using find function twice. The results were discouraging. Scalability of ML models is a serious consideration, so I hope this question would not be considered to be irrelevant.
Comments / clarifications / explanations of errors / misunderstanding on my part would be greatly appreciated. Formal explanation of comparative algorythm complexity of O(find()) vs O(setdiff) woud be awesome as well.
Thank you.
Demo code: ===== (edit: for binary selection complimentary index is not required - initialize answer with all 0, and find index for all 1’s and replace … what about multi-classifiers ? thinking …
% compare two finds and find and setdifff
order_size = 8; % 8 max size of the vector to search to finish in resonable time
size_diff = [order_size,2]; % storage of difference in performance for two methods
for i = 1:order_size
tmp = rand(10^i,1); % generate set of random numbers between 0 and 1
full_index = 1:length(tmp); % to avoid creation inside setdiff and associated time penalty
tic();
index = find(tmp < 0.5);
compl_index = find(tmp >=0.5);
find_twice = toc();
tic();
index = find(tmp < 0.5);
compl_index = setdiff(full_index, index); % find complimentary index
find_n_diff = toc();
% how much slower
find_twice_is_faster = find_n_diff / find_twice;
size_diff(i,1) = 10^i; % size of the vector to search
size_diff(i,2) = find_twice_is_faster; % how many times is setdiff slower
end
semilogx(size_diff(:,1),size_diff(:,2), ‘r-o’)