How can I detect a line of points underneath lots of noise?

Question

I've got a set of points where 5-20% of the points model almost a perfect linear line (say +- 0.1% of their value), but the other 80-95% of the points are random noise. How can I find a line (slope and y-intercept) that best models the 5-20% signal underneath the noise?

Here's a picture that shows the jist of what I'm talking about:

I'm guaranteed that the true signal is going to have more points lined up in a linear fashion than the random noise might accidentally have.

My initial strategy was to perform a linear regression of the set of points, and then remove the furthest 20% of points from that line, and loop through that 2 step process (linreg, remove) until the standard deviation of distances from the line reached a certain threshold. This worked 70% of the time, but the remaining time the initial line from the linear regression too much modelled the noise and so most of the signal would get removed and the process would fail.

I'm working in python, but I'm really looking for a strategy more than complete code. But if you have some specific code in mind here's the setup:

def find_best_line(x_values, y_values):
    assert len(x_values) == len(y_values)
    # Do Something.

    return slope, y_intercept

I need this solution to scale decently (runtime < 30 mins) with sets as large as 50 million points.

Which is why I have discarded my first idea which was to get a line from every possible pair of points and calculate a metric that was essentially proportional to how many points were very close to that line.

Here is an example dataset the algorithm would need to work on: download dataset

You might want to check out Theil-Sen in SKLearn. From what I recall, the documentation gives some alternatives, too, though offhand I can’t remember what they are. — Dave, May 16 '20 at 17:47
Robust regression models (like Least-Absolute Deviations). In the univariate case, it does not matter how much noise if randomly positively and negatively distributed, the estimate is the robust median. — AJKOER, May 16 '20 at 17:54
@Dave interesting thought, the description seems apt, but I'm not sure it can handle this level of outliers, [here's an example I tried using it](https://imgur.com/a/PJdXSo4) — CSStudent7782, May 16 '20 at 18:15
From the documentation in SKLearn: "Compared to the OLS (ordinary least squares) estimator, the Theil-Sen estimator is robust against outliers. It has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional case." so it sounds like it wasn't meant to handle my 95% outlier situation. — CSStudent7782, May 16 '20 at 18:16
Robust regression is (highly) unlikely to solve this problem. One method that *will* is the [Hough transform](https://stats.stackexchange.com/search?tab=votes&q=hough%20transform). — whuber, May 16 '20 at 21:52

How can I detect a line of points underneath lots of noise?

0 Answers0