Should you standardize your variables before or after removing outliers?

Question

Barring the question of how to operationalize outliers, or the utility of doing so, and assuming dependent variables and independent variables are all scaled in the main regression specification (centered and divided by their standard deviations), should the scaling happen before or after outlier removal?

Specifically, I'm wondering if the p-values from coefficients will be affected at all by this decision?

Here is a simulation of what I'm talking about:

meeting_count <- c(.01, .02, .01, .05, .03, .025)

revenue_pre_scaled = scale(revenue)
summary(lm(revenue_scaled[0:4] ~ meeting_count[0:4]))

revenue_post_scaled = scale(revenue[0:4])
summary(lm(revenue_post_scaled ~ meeting_count[0:4]))

Perhaps this is just dumb luck, but here are the summary outputs

> summary(lm(revenue_post_scaled ~ meeting_count[0:4]))
Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)          -1.025      0.526   -1.95     0.19
meeting_count[0:4]   45.572     18.893    2.41     0.14```

> summary(lm(revenue_scaled[0:4] ~ meeting_count[0:4]))
Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.42611    0.00536  -79.57  0.00016 ***
meeting_count[0:4]  0.46401    0.19237    2.41  0.13734

It depends on the specifics of the model. In standard applications, where the model is linear in all the variables, scaling doesn't do anything to the solutions or the p-values, so it doesn't matter how you do it (or even whether you do it at all), except insofar as it might affect numerical precision. Could you therefore make your question more specific? — whuber, Jul 02 '19 at 16:04
@whuber, the actual model is a standard lm() regression that looks like this: `lm(y ~ x + x^2)` where all variables have been standardized. The question is whether to standardize y and x before or after the removal of outliers. It appears as if the p-values are unaffected. Are the beta estimates also supposed to be equivalent, but the decimal point just shifted over twice? — Parseltongue, Jul 03 '19 at 14:50
No, but the beta estimates have a predictable relationship to each other. I believe this is answered at https://stats.stackexchange.com/questions/68202. The p-value of the intercept, BTW, changes because the two tests are not the same: the original test compares the intercept to $0$ while the test based on transformed values of the regressors and response is tantamount to comparing the *original* intercept to some other number, usually nonzero. Again, that other number can be predicted and is easily computed in terms of the transformations that were applied. — whuber, Jul 03 '19 at 14:54
Other relevant posts include https://stats.stackexchange.com/questions/47178/ (addressing the quadratic term) and https://stats.stackexchange.com/questions/110171. — whuber, Jul 03 '19 at 14:58

score 2 · Accepted Answer · answered Jul 02 '19 at 15:09

2

It depends on what you exactly need for your use-case, but if you remove outliers after standardizing, the resulting data won't be standardized anymore (if many outliers are removed, standard deviation could become considerably smaller than 1)

So, if you are about to use a procedure where scaled data in needed, you should definitely remove your outliers first, then standardize. Otherwise you may end up with different variables having different standard deviations (which is an issue, for example, in PCA analysis)

answered Jul 02 '19 at 15:09

David

2,422
1
4
15

1

I sense a problem here. The point is not to standardize but rather to "transform" variables to put them into a format that is more physically correct. Untransformed data is frequently misleading and outliers are frequently miscategorized such that both the question and answer appear to be incorrect. – Carl Jul 03 '19 at 00:36

Should you standardize your variables before or after removing outliers?

1 Answers1