Find the equation for these given points

I have a set of n-dimensional data points that I believe lie on a set of parallel hyperplanes. I would like to deduce the equations for those hyperplanes given the points.

For example, here is a set of 30 points in 4-D space:

coords = [
     0.068725      0.89165      0.47873       0.6745
     0.086289      0.77872      0.58564      0.48276
      0.08723      0.91586       0.2942      0.51796
      0.09489      0.77028      0.30046       0.6921
      0.12148      0.15381      0.19016      0.93594
      0.14719       0.8033      0.49138       0.2615
      0.15486       0.7326      0.74647      0.36555
      0.19593     0.060394      0.94216      0.54249
      0.23846      0.51832      0.59953      0.75279
       0.2849      0.10243      0.47861      0.97743
      0.29188       0.4175     0.025716      0.52277
      0.29509      0.39232      0.81321      0.42047
      0.31451      0.79695      0.13754      0.85448
      0.33484      0.44694      0.59357      0.52262
       0.3771      0.55992      0.98326       0.1467
      0.50646      0.73878      0.46019      0.96566
      0.51967        0.885     0.030646      0.63008
      0.52099      0.47966      0.25772      0.13639
       0.5579      0.93632      0.59255      0.59876
      0.62916      0.27282      0.90561      0.50887
       0.6599     0.081988      0.51417      0.48062
      0.68893      0.30211      0.16113      0.87588
      0.71285      0.49865      0.15539      0.86974
      0.75203       0.7089     0.036614      0.40885
      0.75912      0.09958      0.58503      0.87416
      0.81015      0.34089      0.86382      0.87492
      0.81982      0.89656      0.91465      0.18355
      0.90986      0.98749      0.82763     0.070998
      0.91422    0.0015857      0.97149      0.51201
      0.96927      0.85749      0.80153      0.72124
];

Using corrcoef, cov, etc do not indicate any correlation between the columns. Even PCA does not indicate that the data set has any interdependence, nor does convhulln. But searching exhaustively for hyperplanes like this:

nd = columns(coords);
N = 20;  # some maximum coefficient value

# make a list of possible coefficients
v = (0:(2*N+1)^nd-1)';
for i = nd:-1:1
  lst(:,i) = mod(v, 2*N+1);
  v = (v - lst(:,i)) / (2*N + 1);
end
lst -= N;
clear i v ans

lst = lst(sum(abs(lst),2) <= N, :);
lst = lst(lst(:,1) >= 0, :);
lst = lst(sum(abs(lst),2) > 0, :);

for i = rows(lst):-1:1
  r = round(coords * lst(i,:)' * 1e9) / 1e9;
  count(i) = numel (unique (r));
end
lst = lst(count == min(count),:)

gives a set of coefficients lst = [7 -1 -4 -1], so multiplying coords with those coefficients gives:

>> unique (coords * lst')'
ans =
  -4  -3  -2  -1   0   1   1   1   3   4   5

Disregarding the repeats due to minor rounding errors, this indicates that coords * [7 -1 -4 -1]' falls in a set of parallel planes with one of the above integer RHS values.

In other words, the 30 points do not fully span 4-dimensional space, but fall in a set of some 9 or 10 parallel planes.

If the points were random numbers, then they would fail the spectral test, but in my case I already know that the numbers are not random and are very likely falling in planes, except I don’t know their equations.

Question: How to get this family of hyperplanes from given points without an exhaustive search over all coefficient values?

Sounds like an ill posed problem. Imagine a scenario where the points are distributed perfectly on a grid. Should the algorithm fit planes parallel to X plane or Y plane or Z plane or W plane ? (dog-wearing-pants meme)

If there are two sets, but the distance between the two planes is 1e-3. Should the algorithm fit one plane or two planes? For some applications, 1e-3 would be rounding error, for some applications 1e-3 might be clearly distinct.

Have a look at Hough transform. But it is subject to the curse of dimensionality IIRC. (I didn’t go through your code, but it may already be mimicking Hough transform; I’m not sure)

Another idea that comes to mind is K means clustering. There are other clustering algorithms also.

2 Likes

We’re tracking on similar paths. I was already using the kmeans function from the statistics package for some of my trials. It feels like kmeans in a way, but instead of minimizing “distance to the centroid of the nearest cluster”, I want to minimize “perpendicular distance to the nearest hyperplane”. I couldn’t do that with the kmeans function itself but maybe I can supply my own objective function for that distance metric and then use function minimization…

My hope was that there would be a statistics technique (like PCA) that would give the dimensionality and alignment of the data. Haven’t found that yet. Turns out that spectral tests can only give the binary value of whether the data is full-dimensional or not, but not the hyperplanes themselves. E.g. look at the Wikipedia article for RANDU and you can see the spectral test giving 15 planes, but the planes themselves have to be calculated using math theory.

I have to look at the Hough transform – it looks promising and I had not known that technique.

For some of the smaller instances I tried, I was indeed getting 3 different solutions, meaning three different families of hyperplanes in three directions. In fact, that was useful for the problem because it indicated that there was symmetry to the system that was not being captured by the existing model. They’re not on a grid, but there’s indeed some regularity that needs to be examined.

Rounding is not an issue for me but I was using the round (foo * 1e9) / 1e9 for that reason. I can always tune the rounding precision, or multiply by some number to have only integers.

A few more questions

  1. How many 4D points in the intended application ? 100 ? 1000 ? 1e7 ?
  2. Related to the grid of points in my above post: Given a point in your data, is it a member of exactly one plane ?
    a. Or can a point be part of two (intersecting) planes ? I.e is there only one set of parallel planes or can there be multiple sets with each set being inclined to the other set ? (this will make the problem really hard, IMO)
    b. Can a point not lie on any of the parallel planes ? i.e. does the data contain wild points corrupted by noise ?
  3. Is there an upper limit on the number of parallel planes in the data ? Is that limit known ?

The issue is that principal component analysis seeks to primarilly find a projection of the data to one dimension which maximizes the variance. If your data is multivariately normally distributed the PCA will find the “longest” factor explaining the data. But this does not hold when the data one is working with is for example multimodal

In this case it might make sense to use Independent Component Analysis (ICA) in stead. It is a strategy for finding latent factors which aims to make the latent factors searched for as statistically independent as possible. It makes no assumption about the distribution of the data.

If you are expecting clusters in your data, PCA is probably not the right tool

Thanks for the replies, @octavecontrib. I agree that PCA is not the best technique for the features I’m extracting. I was using it because it served me well in the past to give the effective dimensionality of data and its silhouette in lower dimensions. It doesn’t work for this for the reasons you describe.

I’ll have between 50 and 400 data points per set. Each point will belong at at least one plane. If there’s no symmetry, then exactly one plane each, otherwise each point will belong to two or maybe three planes. That number is the same for all points in each set. The symmetry gives the directions of the intersecting planes. No noise, no wild points. The upper limit of the number of parallel planes in any one direction is the square root of the number of points, and preferably lower. That is, the fewer planes covering all the points, the more powerful the result.

My exhaustive code above in the first post made use of a result adapted from Marsaglia’s spectral tests that the number of parallel planes will not exceed the sum of the absolute values of the coefficients. That allowed me to do an exhaustive search, but the price is curse of dimensionality.

I think calling kmeans with a custom distance metric (distance to nearest plane) is the most robust of the lot. Exhaustive search works for the smaller sets.

Can you briefly explain the working of your current code? There might be some ways to avoid “exhaustive” searching.

  1. If having to search “all” directions in N-D space is what you are worried about, you could try searching “one dimension” at a time like done in graham schmidt procedure. This would probably change the complexity of your algorithm from k^N to k*N.
  2. You could “seed” or “bias” your lst with directions constructed from (random) D-plets from your coords. That way you are only searching those directions formed by hyper planes actually constructed from the coords.

I have not fully understood your code, but this is what I figure the code is doing:
Up to the line lst = lst(sum(abs(lst),2) > 0, :);, the code is generating a huge number of 4D vectors possibly pointing in “all” possible directions. (If elements of lst are “directions”, the code doesn’t seem to be normalizing them to unit length).

Then in the loop, coords * lst(i,:)' you are taking the dot product of each vector in lst with all the coordinates. i.e. projecting the coordinates onto the line. If one of the elements of the vector lst is normal to the hyper planes, then it should give the least number of “steps” with each step corresponding to one of the hyper planes. The number of “steps” is being detected using unique and numel. finally the vector in lst which gave the least “steps”.

I would assume that, if one didn’t know about the physical process which generated coords, it would require some trial and error to come up with good numbers for N, 1e9 etc.

The equations of the planes are taken to be A * coords(:,1) + B * coords(:,2) + C * coords(:,3) + D * coords(:,4) == some set of RHS values K.

If it is possible to describe all the data points using only some N parallel planes, then there will be N different values of K for the full data set, but much more usefully, abs(A) + abs(B) + abs(C) + abs(D) <= N when that hyperplane is written in standard form with K written as an integer. This last inequality is because of Marsaglia, and comes out of his original work on spectral tests with linear congruential RNGs. (In other words, if a RNG’s outputs fall on such a set of hyperplanes for small coefficient values, then those “random” numbers are grossly non-uniform in that space. Large coefficient values are necessary, though not sufficient, for an RNG to be well-distributed.)

Applying that result backwards, if we are explicitly searching for a set of no more than N parallel hyperplanes to fit all the data points, then we only need to search for coefficient values that are in the range -N to +N, inclusive. Also the sum of absolute values of all coefficients cannot exceed N.

The variable lst first generates a big set cross product of -N to +N in each of the dimensions, then filters it with sum(abs(lst), 2) <= N. The size of lst before filtering is some (2*N + 1) ^ number of dimensions of the data rows. To eliminate the case where all the coefficients are negated (which would still give the same equation multiplied by -1 throughout), the first coefficient is filtered to be non-negative. Then the all-zeros coefficient set is kicked out.

(The above generation can be sped up by e.g. kicking out violations as they occur instead of generating the full cross product and only then filtering. Also we need to consider only those coefficients that are all relatively prime to each other, so a call to gcd would help, but for the small problem size it was OK not to do that.)

After that, it calculates r = A * coords(:,1) + B * coords(:,2) + C * coords(:,3) + D * coords(:,4) where ABCD is each row of lst. Then it counts how few distinct values of r there are for that row of lst. The value of 1e-9 is used here and needs to be tuned to each instance, but any number <= 1e-3 is OK. Because of the Marsaglia logic above, the RHS values will become integers for the optimal coefficients, so being off by 1e-3 won’t hurt.

Finally it takes the lowest count among all rows of lst, and declares that to be the coefficients which cover all the data points in the fewest number of parallel hyperplanes. (It’s possible for lst to have more than one row – those would all be equally good or bad at describing the data.)

If the best of lst still needs large coefficient values, or if the lowest value of count is still high, then I currently report it as “There isn’t a small set of hyperplanes that cover all data points: any such set would need more than N planes”.

Here is an artificial example derived from a 7x7 latin square, carefully constructed to show symmetry.

Take these 49 points in 3-D space.

coords = [
  1 1 1;   1 2 4;   1 3 7;   1 4 3;   1 5 6;   1 6 2;   1 7 5;
  2 1 6;   2 2 2;   2 3 5;   2 4 1;   2 5 4;   2 6 7;   2 7 3;
  3 1 4;   3 2 7;   3 3 3;   3 4 6;   3 5 2;   3 6 5;   3 7 1;
  4 1 2;   4 2 5;   4 3 1;   4 4 4;   4 5 7;   4 6 3;   4 7 6;
  5 1 7;   5 2 3;   5 3 6;   5 4 2;   5 5 5;   5 6 1;   5 7 4;
  6 1 5;   6 2 1;   6 3 4;   6 4 7;   6 5 3;   6 6 6;   6 7 2;
  7 1 3;   7 2 6;   7 3 2;   7 4 5;   7 5 1;   7 6 4;   7 7 7  ];

You can visualize them with plot3 to see that they all lie in some sort of repeating structure. You can see that the projections in the xy, xz and yz planes are all just a 7x7 grid, because it’s derived from a latin square. But it turns out that it needs only 5 parallel planes to cover all 49 points, not 7:

x + 2y - 3z ∈ { -14, -7, 0, 7, 14 }

Those 5 planes have two symmetric rotations:

y + 2z - 3x ∈ { -14, -7, 0, 7, 14 }

and

z + 2x - 3y ∈ { -14, -7, 0, 7, 14 }

The Marsaglia constraint is obeyed: the sum of the absolute value of the coefficients on the LHS is 1+2+3 == 6, so there are no more than 6 parallel planes with those coefficients, and there are only 5 different values on the RHS.

It seems to be difficult to generalize this kind of geometric analysis to more than 3D though, at least not without explicit numerical minimization somewhere (either in the coefficient space or in kmeans). I could not understand your Gram-Schmidt analogue yet but I’ll keep at it.