I have a test set of shape (1100, 273). I impute literally 3 values that are more than 20 standard deviations away from the mean with their column means and I get from accuracy (r sqrd) of 0.286 to -6.59983, and from log MSE of 0.144 to 1.479. And from all positive predictions to positive and negative predictions. Can you please explain how it is possible that such a tiny change can produce a totally different result? I would expect linear regression to at least have a slightly worse accuracy.. Why is the accuracy so bad suddenly? Is there something intrinsically 'broken' in the data that makes the predictions so bad?
(Additionally I don't get how variance/variance could ever be negative.)
Here is how I replace the values:
for col in list(data):
unique_val = np.array(pd.Series(data[col].unique()).dropna())[0]
if type(unique_val) is not bool:
bools = np.abs(data[col] - data[col].mean()) > (20 * data[col].std())
data.loc[np.array(bools), col] = data[col].mean()