Set operation complexity setdiff vs using find twice to index / complimentary index data

Dear maintainers,

this is a question related to what might look like an unexpected result of inferior efficiency of the setdiff function in finding complimentry set of indices of a vector given one set, compared to using find function twice.
While setdiff seems to be improving efficiency compared to “2 * find()” for relatively small sets of data up to ~1000, it quickly looses comparative efficiency as data sets get larger.

It seems counterintuitive. Manipulating ordered sets compared to searching random vector - granted find function most likely got quite a bit more attention from developers.

Got an impression from a recent python refresh class that set operations were highly efficient granted memory penalty. Well Octave is not python, but algorythms should be similar ?

In a recent excercise for an ML classification problem (result of the model is either 0 = failure or 1 = success) I came up with an idea of using setdiff instead of using find function twice. The results were discouraging. Scalability of ML models is a serious consideration, so I hope this question would not be considered to be irrelevant.

Comments / clarifications / explanations of errors / misunderstanding on my part would be greatly appreciated. Formal explanation of comparative algorythm complexity of O(find()) vs O(setdiff) woud be awesome as well.

Thank you.

Demo code: ===== (edit: for binary selection complimentary index is not required - initialize answer with all 0, and find index for all 1’s and replace … what about multi-classifiers ? thinking … :nerd_face:

% compare two finds and find and setdifff

order_size = 8; % 8 max size of the vector to search to finish in resonable time
size_diff = [order_size,2]; % storage of difference in performance for two methods

for i = 1:order_size
tmp = rand(10^i,1); % generate set of random numbers between 0 and 1
full_index = 1:length(tmp); % to avoid creation inside setdiff and associated time penalty

    index = find(tmp < 0.5);
    compl_index = find(tmp >=0.5);
    find_twice = toc();

    index = find(tmp < 0.5);
    compl_index = setdiff(full_index, index); % find complimentary index
    find_n_diff = toc();

% how much slower
find_twice_is_faster = find_n_diff / find_twice;

size_diff(i,1) = 10^i; % size of the vector to search
size_diff(i,2) = find_twice_is_faster; % how many times is setdiff slower


semilogx(size_diff(:,1),size_diff(:,2), ‘r-o’)

I don’t think in most cases you would be calculating the same thing. The function setdiff not only produces the difference of two sets, it also lists unique elements just once, and the values are returned in sorted order. It isn’t clear to me that two calls to find will do that for all input data sets A and B.

As an example,

octave:2> A = [5 4 3 3 1 2 1];
octave:3> B = [4 3];
octave:4> setdiff (A, B)
ans =

   1   2   5

For the specific case you posted, you would probably be better off skipping find altogether. This code will calculate the index and it’s complement, and uses 4X less memory because logical values require only 1 byte of storage rather than double precision indices which use 8 bytes.

    index = tmp < 0.5;
    compl_index = ! index;

3~7 times faster than double call to find() !
Thank you Rik, much appreciated

is there a Matlab equivalent to this syntax, specifically “inversion” of the index with “!”,
and another question - where in Octave documentation is this behaviour described I mean both lines for index and complimentry index?

Thank you
hmm … MATLAB outputs index as the same size vector 0 / 1 with 1 for met conditions, and ~ index in the same format just reversed …
index = tmp < 0.5;
compl_index = ~ index;

gotta be careful using Octave code in Matlab :frowning:

Neah other than ~ vs ! implementation of the indices does not seem to be a reson to be concerned about syntax.

In Matlab this is "~".

~ and ! are boolean ‘not’ operators. e.g., ~= and != are ‘not equal’ operators.


1 Like