Clustering in data

Question

I have a set of data points $(x,y)$ where $y = f(x)$. My goal is to fit the function $f$ using ols. The choice of function $f$ is quadratic due to domain know-how.The independent variable $x$ exhibits clustering, i.e. a lot of observations in a particular region and relatively sparse in others (shown is a plot of $y$ versus $x$).

I am looking to address this clustering because I think it is biasing the results. One way to solve this problem is to try robust regression, as I think y it is a weighting problem: in other words, the clustered data points are weighted more than they should be. I am curious if there are best practices on how to address this problem and what should I be aware of.

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

4

The distribution of $X$ is not an assumption in OLS regression. Thus the 'clustering' in $X$ is not necessarily a problem.

In addition, data tend to have less leverage* over the fitted regression line the closer they are to the mean of $X$. I note that the high density cluster is roughly where the mean would have been if $X$ had been uniformly distributed. Therefore, if you are worried that the cluster will have too much influence, I suspect those worries are misplaced and no weighting need be done.

On the other hand, if you are worried the cluster won't have enough influence (I don't interpret this as your concern), then you could use WLS (weighted least squares). You would develop a weighting scheme from some prior theoretical understanding that would make the clustered points more influential than the more sparse points to the sides.

For example, if you repeatedly sample the same x-value ($x_j$), you should get an ever better approximation of the vertical position of $f(x_j)$. As a general rule of study design, you should try to oversample locations in $X$ where you are most interested in knowing $f(X)$. However, note that if you oversample in an area you don't care about and undersample where you are primarily interested, and your functional form is misspecified (e.g., you include only a linear term, when a quadratic is required), then you could end up with biased estimates of $f(X)$ in your locations of interest.

If you are worried that the nature of the function $f(X)$ might be different inside the cluster than outside it, @user777's suggestion to use splines will take care of that.

_{* Also see my answer here: Interpreting plot.lm().}

edited Apr 13 '17 at 12:44

Community

1

answered Nov 30 '15 at 17:45

gung - Reinstate Monica

132,789
81
357
650

Gung comes to the rescue! This is the answer I was looking for, many thanks.I mean what threw me off, is imagine that I am repeatedly sampling same x for the same y, and add that to my sample data, is it not tantamount to increasing the weight on that particular x, something similar is happening here? I guess what you mean is that the weight may he high, but the error term would be low as we are close to the mean? – gbh. Nov 30 '15 at 17:49
You're welcome, @gbh. (See edit for response to comment.) – gung - Reinstate Monica Nov 30 '15 at 17:58
Thanks gung. As a follow up, if I sample an x over and over again, that is same as weighing that observation more. Whether or not the sampling changes the regression depends on the leverage of x right. If leverage of x = 0, I can sample it infinitely more often and it wont change anything righjt? – gbh. Nov 30 '15 at 19:26
Not exactly. If you sample the mean of X a million times, and other x-values 5 times each, that won't have any effect on the slope but will affect the intercept; w/ a quadratic term things will get more complicated. – gung - Reinstate Monica Nov 30 '15 at 19:42
OK. How do you prove this, "For example, if you repeatedly sample the same x-value (xj), you should get an ever better approximation of the vertical position of f(xj)"? – gbh. Nov 30 '15 at 20:54
@gbh., there are some assumptions at play there, but the basic idea follows directly from the [law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). If you need a complete proof, you should ask a new question; that doesn't belong in comments or in an otherwise unrelated thread. – gung - Reinstate Monica Nov 30 '15 at 21:04
Ok, posting it now, thanks a lot for bearing with me gung. – gbh. Nov 30 '15 at 21:08

Sycorax · Answer 2 · 2015-11-30T17:07:01.550

2

Or is the shape more complicated? Restricted cubic splines fit flexible functions to different sections of the data; cubic spline design matrices can be estimated with OLS. MARS accomplishes a similar effect (different models for different subsets of points), but discovers the subset locations on its own.

Splines build up from this idea: a data set can be partitioned into different intervals, and each of those intervals can be modeled with some degree of accuracy as a constant. Alternatively, you can model each interval as a linear function. Still more elaborately, you can make the linear functions continuous at the interval boundaries. These boundaries are "knots." This solution is a piece-wise linear model. Restricted cubic splines further enforce that the function have several orders of differentiability at the knots, and that the function be linear in the intervals $(-\infty, k_1],[k_n,\infty)$, i.e. "outside" of the last knots. This linearity requirement is motivated by a desire to not over-extrapolate in very data-sparse regions. Restricted cubic splines allow for non-monotonic functions, e.g. the function can increase and then decrease and then increase again.

How to select knot locations is a sticky wicket, since it will influence what the ultimate model looks like. Frank Harrell's book Regression Modeling Strategies has some recommendations based on quantiles.

edited Nov 30 '15 at 17:07

answered Nov 30 '15 at 16:38

Sycorax

76,417
20
189
313

I am fitting a quadratic function and yes, they do have that physical and visual relationship. – gbh. Nov 30 '15 at 16:43
How do splines get over the problem of clustering in general? – gbh. Nov 30 '15 at 16:45
Just added a pic. – gbh. Nov 30 '15 at 16:48
While the idea of MARS is interesting, I think the goal here is to fit one model but somehow address the clustering of the data. – gbh. Nov 30 '15 at 16:53
Would you please elaborate more, how would a spline address that? As in the clustering in the data.Thanks a lot. – gbh. Nov 30 '15 at 16:56
I guess I am hesitant because 1) we know that the relationship is quadratic and that is the choice of the model 2) the problem becomes how to account for the increased weight of points in the clustered region and shift the weight to more sparse regions 3) there is no reason to believe that the relationship in the cluster is not quadratic as per 1 – gbh. Nov 30 '15 at 17:13
IMHO, there's nothing quadratic about the plot you've shared. It looks like if you're near the middle, you're bi-variate normal with near-0 off-diagonal correlation. If you're not near the middle, you're near a constant value. I don't know what you mean in comment (2). – Sycorax Nov 30 '15 at 17:16
Well the data is noisy, but a quadratic fit yields an r squared of 0.1 (to be expected) at a significance of 99% – gbh. Nov 30 '15 at 17:18
Those aren't really meaningful metrics. Check what the difference in out-of-sample RMSE for constant, linear and quadratic models. I think you'll be surprised by what you find. – Sycorax Nov 30 '15 at 17:20
No I understand what you are saying, dont take me wrong. The data is generated from certain physical processes which are known to be super noisy but physically modeled and understood to be quadratic. If I do an out of sample, I won't get any meaningful difference between constant, linear and quadratic model precisely due to the noise in the data. – gbh. Nov 30 '15 at 17:25
I don't know what to say. If you're convinced the data **must be** quadratic, then there's no model selection problem at all: fit a quadratic model and knock off for the day. On the other hand, your data don't look quadratic at all (to me!), which provides some evidence that it might not be quadratic (or, alternatively, it may be quadratic and also noisy). I'm not sure I can help here. – Sycorax Nov 30 '15 at 17:33

Clustering in data

2 Answers2

Linked