Fitting linear model through noisy data

Question

I'm currently working on a predictive modeling project. I have to predict $Y$ given variables $X_1,X_2,X_3$ and $X_4$ that are not necessarily independent. Our first idea was to propose a linear regression model defined as $$Y = \beta_0+\beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4.$$

In my dataset ($10^5$ observations), I have observed that a lot of data is kind of 'grouped'. To clarify 'grouped', I have data $(x_{1i}, x_{2i},x_{3i},x_{4i},y_{i})$ and $(x_{1j},y_{2j},x_{3j},x_{4j},y_{j})$ where $$x_{1i} = x_{1j}, x_{2i} = x_{2j}, x_{3i} = x_{3j}, x_{4i} \neq x_{4j}, y_i \neq y_j.$$

where $1 \leq i,j \leq 10^5$ and where $x_{kl}$ is the $l$th observation of variable $X_k$ with $k \in \{1,2,3,4\}$.

Hence, a lot of data where $X_1,X_2$ and $X_3$ coincide and where the $X_4$'s and the $Y$'s are relatively different. After fitting the model, the performance was really bad. I believe that this 'grouped' data has a great impact on the goodness of the fit since the model tries to fit as much data as possible leading to overfitting.

Is there some kind of way to deal with this?

Thanks in advance!

@Siron Please edit your post to provide [these details](http://arfer.net/w/statqgl). — Kodiologist, Dec 24 '16 at 06:47
If you can provide a random sample of your data (possibly standardised if you want to preserve anonymity of the data) it would be simpler to help. As you described the problem right now, there could be many possible reasons why the fit is "bad". Also, what does "bad" mean for you? Low $R^2$? High cross-validation error? Residuals vs fitted plot shows a trend? Etc. — DeltaIV, Dec 24 '16 at 09:39
The measure I used to determine the goodness of fit is the value of the RMSE which is approximately 70. — Cavents, Dec 24 '16 at 10:08

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

If I understand your question correctly, the issue is that X1, X2, and X3 are all highly correlated. That's a problem with multicollinearity among your predictors rather than non-independence in your data (grouping).

There are a number of solutions for this. The simplest solution is to drop redundant variables, if you're okay with that. If X1, X2, and X3 are all highly correlated, then a model that just includes X1 and X4 might be fine. If for some reason you don't want to drop any variables, you can use principal components analysis to separate them into orthogonal components, or use another type of model that handles multicollinearity well like ridge regression. Here's a relevant answer with some useful links: https://stats.stackexchange.com/a/124232/131407

Thanks for the answer! I will definitely try PCA to see the impact on the goodness of fit. — Cavents, Dec 24 '16 at 10:09

Fitting linear model through noisy data

1 Answers1