What are similarity measures between a line and a set of points?

Question

A colleague discussed about a concept of similarity between two different entities line and a set of points. My first guess for solution was considering distances squared ($d^2$) as in LS. So for the following demonstration in which the black line is fitted by means of Least Squares method is more similar to the points (and vice versa) compared to the orange line.

What do you think about this method?
Any more rigid and statistically mature answer for the question?

enter image description here

I don't think this is a meaningful question, given the usual meaning of similarity. You can measure how close points are to a line, and do so various ways, including median absolute deviation and mean squared deviation. But is this really "similarity"? I think, to riff on Lewis Carroll, that this is like asking how similar a raven is to a writing desk. — Peter Flom, Feb 18 '12 at 17:09
but could you not simply represent the line as a vector of points? — Jeff, Feb 18 '12 at 17:13
@PeterFlom Well, I had the same feeling when I was asked to answer the question. However as @Jeff mentioned I thought it should be some ways. The way I am thinking about is to consider the set of points as one entity say, `object1`. Then it does not look odd at all thinking about similarity between two objects, the line and the `object1`, I think. — Developer, Feb 19 '12 at 00:34
A "statistically mature" answer would begin with *questions* about the objective here, recognizing that "similarity" can mean so many different things. What do the points represent--a sample of something, something measured with error, a population of things, an xy scatterplot of results of a controlled experiment? What is the purpose of measuring "similarity" to a "line"? (What is a "line", even: a Euclidean line or perhaps just a line *segment*?) What would it cost your client to accept suboptimal solutions? — whuber, Feb 19 '12 at 16:36

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

Great question! Your intuition was good. There is indeed a strong relationship between methods of linear regression and metrics for measuring similarity between a line and a set of points.

The correspondence betwen linear regression and similarity metrics. Here is the relationship. Let $d(\ell,S)$ be a similarity metric that measures the similarity between a line $\ell$ and a set $S$ of points. Then, linear regression is basically the following problem: given a set $S$ of points, find the line $\ell$ that minimizes $d(\ell,S)$. Consequently, each different similarity metrics gives rise to a different flavor of linear regression. And, dually, from any particular flavor of linear regression, you can extract a corresponding similarity metric that the regression method is minimizing.

Notation: if $\ell$ is a line, I will write $\ell(x)$ for the $y$-coordinate on the linear corresponding to $x$. In other words, I'll treat $\ell$ as a function of $x$.

Example 1. Ordinary least squares regression corresponds to the distance measure you mention: namely, $d(\ell,S) = \sum_{(x,y) \in S} (\ell(x) - y)^2$. Thus, for each point in $S$, we consider the vertical distance from the point to the line $\ell$, square it, and take the sum of these squared residuals. Pictorially, we're summing the squares of the lengths of the orange lines in this picture:

OLS

Example 2. If you did ordinary least squares regression, but with the role of $x$ and $y$ swapped, then that would correspond to a slightly different similarity metric: namely, $d(\ell,S) = \sum_{(x,y) \in S} (\ell^{-1}(y) - x)^2$. This corresponds to taking the horizontal distance from each point to the line $\ell$, and summing these squared distances. We're summing the squared lengths the orange lines in this picture:

OLS, with coordinates swapped

Example 3. If you did principal components analysis (PCA) (also known as orthogonal regression or total least squares (TLS)), that would correspond to a slightly different distance measure, namely, the orthogonal distance from each point to the line $\ell$, squared and summed. Now we're summing the squared lengths of the orangle lines in this picture:

PCA

Summary so far. There are many possible similarity metrics. Which one you choose will depend upon the nature of your application and your model for the data.

How to choose a similarity metric. OK, so there are many possible similarity metrics. How should choose which one to use? As already suggested, to answer this, you will need to have some model or hypothesis for how your data were generated.

Ordinary least squares starts from the following model: it assumes that, for each point $(x_i,y_i)$, $y_i$ was generated as a probabilistic (stochastic) function of $x_i$. It assumes the process went something like this: there's some line $\ell$ (not known to us a priori), and given the input $x_i$, we first calculate $\ell(x_i)$ (the $y$-coordinate on the line corresponding to $x_i$), then we randomly generate some noise/error value $e_i$ from a normal distribution with mean 0 and standard deviation $\sigma$ (where $\sigma$ is not known a priori), and then we set $y_i = \ell(x_i) + e_i$. Since each $e_i$ is assumed to be iid Gaussian, a reasonable measure of how closely a hypothesized line $\ell'$ fits this model is to take the sum of squares of the inferred $e_i$ values. For the correct line $\ell$, we will have $d(\ell, S) = \sum_i e_i^2$, which has expectation $n \sigma^2$. For an incorrect line $\ell'$, we will have $d(\ell', S) = \sum_i (e_i \pm \ell'(x_i)-\ell(x_i))^2$, which has expectation $n \sigma^2 + \sum_i (\ell'(x_i)-\ell(x_i))^2$, which you might notice is larger than the expectation for $d(\ell, S)$. In other words, starting from a particular stochastic model for how the data was generated, OLS uses a particular distance measure that is well-suited to testing how closely the observed data fits the presumed model.

Of course, this is not the only plausible model for how your points might have been randomly generated (as a function of the line $\ell$). Given a different process, we might end up with a different similarity metric.

Hypothesis tests. Finally, let me mention one more topic that you didn't bring up, but you could have. Let's say you have a stochastic model for how your points were randomly generated. Let's say that the line $\ell$ is the parameter of this random process. Then another interesting question we could ask is: given the observed set of points $S$, test the hypothesis that the points $S$ came from the aforementioned random process, using line $\ell$ as parameter. One could work out a way to compute a $p$-value for the likelihood that $S$ were generated according to that distribution. If the $p$-value is very small, you might reject the hypothesis that the points were generated from this particular random process with parameter $\ell$ (roughly speaking, this would amount to saying that the points $S$ don't seem to correspond to the line $\ell$). If the $p$-value is not too small, you might proceed as though the hypothesis is potentially true (roughly speaking, this would amount to saying that the points $S$ are consistent with the hypothesis that they came from your random model with line $\ell$). So, hypothesis tests might be another way to, very roughly speaking, measure whether the points seem to be consistent with any particular line.

Credits. I gratefully acknowledge @JD Long for these excellent pictures. See also his answer explaining different forms of regression for more insight.

+2 Thanks for answering. There are some interesting points in your answer. Your guess is correct: they come from stochastic processes. I need to read your answer thoroughly, so. — Developer, Feb 20 '12 at 12:29

score 1 · Answer 2 · answered Feb 19 '12 at 01:35

First you need to determine exact similarity by yes or no. A line is a set of points, this set of points in a line is an infinite set. There would be a null set of points which would be similar to every line feasible. There would also be a set of points that only has one point, and it would be similar to an infinite set of lines that pass through it, all with varied slopes. Once your set of points contains two points you are now limited to a specific line as any two points form a line, and they will be exactly similar to only this line. Any more points than two and the case can only be made where the points are similar to a specific line. This would indicate the points are all contained in line or not. Once this is ironed out that the set is or isn't exactly similar to a line or lines, you could then determine how dissimilar the set of points are from a line. Which line would you be talking about though. Your least squares method would determine the line of which it is most similar, but this set would still have a similarity or dissimilarity to all other lines. There would also be an infinite number of distinct sets of points that would be just as similar to that same line. Given some framework like this you could use this to determine the similarity of a set of points to a line using yours or any other distance.

What are similarity measures between a line and a set of points?

2 Answers2