Why tiny changes in data cause big changes in accuracy?

Question

I have a test set of shape (1100, 273). I impute literally 3 values that are more than 20 standard deviations away from the mean with their column means and I get from accuracy (r sqrd) of 0.286 to -6.59983, and from log MSE of 0.144 to 1.479. And from all positive predictions to positive and negative predictions. Can you please explain how it is possible that such a tiny change can produce a totally different result? I would expect linear regression to at least have a slightly worse accuracy.. Why is the accuracy so bad suddenly? Is there something intrinsically 'broken' in the data that makes the predictions so bad?

(Additionally I don't get how variance/variance could ever be negative.)

Here is how I replace the values:

for col in list(data):
   unique_val = np.array(pd.Series(data[col].unique()).dropna())[0]
   if type(unique_val) is not bool:
        bools = np.abs(data[col] - data[col].mean()) > (20 * data[col].std())
        data.loc[np.array(bools), col] = data[col].mean()

You can get a negative $R^2$ (albeit is is a relatively uncommon phenomenon when a fit is rather suboptimal); see [here](http://stats.stackexchange.com/questions/12900) for some additional details. Having said that removing very strong outliers ($>20\times$ SD away from the mean) makes it very plausible that some pretty dramatic changes will take place in your final estimates. Remember, the linear regression estimates are effectively conditional means such that $E(Y|X) = X\beta$. — usεr11852, Nov 27 '16 at 01:09
If you are getting an $R^2$ of minus 6 as your post suggests then you have serious programming problems. — mdewey, Nov 27 '16 at 11:58

score 1 · Answer 1 · answered Nov 27 '16 at 03:15

Not sure about your exact problem setting but outliers can have a big impact a linear regression fit using the least square as the loss function. Since the LS is based on the squared differences from the mean then any outlier that can distort significantly the mean value, will impact the validity of your LS based fit.

One reference that is traditionally mentioned when introducing outliers is the anscombe quartet that are plots that refer to a similar situation as yours.

https://en.wikipedia.org/wiki/Anscombe's_quartet

Why tiny changes in data cause big changes in accuracy?

1 Answers1