Regression Model Replicates: Precision VS Significant Figures

Question

I am trying to figure out how many significant figures I should report after doing a linear regression in Excel.

I have a dataset of 740 entries.

I use 75% of them as a training set for the regression and 25% as the testing set. Each set is determined randomly.

I do a first regression I come up with some values for my coefficients and when I use them I get a pretty good match with my test data. The average residual is of 0.14

However, if I re-do the regression with another random training set, I get slightly different coefficients.

I therefore round up my coefficient to the closest common digit between the two replicates, and use those rounded coefficient to predict the output of my test set. I now get a mean residual of 0.5, which is almost 40 bigger than when I used the non-rounded coefficients from my first replicate.

As you can see from the graph, when I use the unrounded coefficients from each replicate I get much better results than when I use the coefficient rounded to the closest matching digit between each replicate.

I feel like I should report the rounded value as the results are likely to be the same no matter what portion of the dataset is used as training set. Yet, I find it confusing that someone using those rounded value would get a worse match...

Any recommendations ?

Tim · Accepted Answer · 2017-10-18T12:48:50.070

1

There is a number of things that are wrong with your approach:

If you split your data into train and test set, estimate your model, and then again split your data to different train and test set, then you are potentially leaking information from the test set, since some observations from the first test set will potentially fall into second test set and vice versa. This is bad and destroys the whole idea of keeping the separate test set.
Rounding the values obviously leads to making them less precise, why would you round them?
Mean of residuals is not a reasonable error measure, why aren't you using some commonly used measure like mean of squared residuals? For example, a model that made predictions like the ones showed below would have mean of residuals equal to zero - would you say that it made perfect predictions?

There is also a number of better regression diagnostic plots then predicted vs observed.

edited Oct 18 '17 at 12:48

answered Oct 18 '17 at 12:35

Tim

108,699
20
212
390

Thank you for the reply. Would you say I can stick to one train and test set and just report the values to some "acceptable" accuracy, say 2 or 3 digits ? – Sorade Oct 18 '17 at 13:36
1

@Sorade yes, this is pretty standard approach. Of course there are scenarios where you use multiple train/test sets like k-fold cross-validation, but you don't use them to average the parameters over the sets, but to verify the models performance. – Tim Oct 18 '17 at 14:07
just to make sure I understood you correctly. So in my case because the model performs well already I can just report the train and test set I used and the regression statistics that correspond ? – Sorade Oct 18 '17 at 14:47
@Sorade Yes. Spliting it multiple times does not help while invalidating the procedure of using separate test set. – Tim Oct 18 '17 at 14:50

Regression Model Replicates: Precision VS Significant Figures

1 Answers1