22

We've run a split test of a new product feature and want to measure if the uplift on revenue is significant. Our observations are definitely not normally distributed (most of our users don't spend, and within those that do, it is heavily skewed towards lots of small spenders and a few very big spenders).

We've decided on using bootstrapping to compare the means, to get round the issue of the data not being normally distributed (side-question: is this a legitimate use of bootstrapping?)

My question is, do I need to trim outliers from the data set (e.g. the few very big spenders) before I run the bootstrapping, or does that not matter?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user31228
  • 391
  • 3
  • 5
  • 1
    Good question: I can probably argue pro and against the removal of outliers. Why not use medians if you worried about outliers and what you are looking for is just a "central tendency"? Given that money-related variables often have highly skewed distribution (eg. Pareto) that might not unreasonable in the first place. – usεr11852 Oct 08 '13 at 15:14
  • 1
    @user11852 Medians tell you little about the mean, which is what is relevant to revenue. It would be interesting to see your argument in favor of removing the "outliers," especially when these are likely the major contributors to the total revenue. – whuber Oct 08 '13 at 15:33
  • 1
    Unfortunately median would always be zero, as < 10% of users spend at all – user31228 Oct 08 '13 at 15:35
  • Also, consider whether in practice you make more profit from the "lots of small spenders" or the "very few big spenders." If you make your profit from those "outliers" you probably don't want to remove them--maybe you want to analyze them separately. – EdM Oct 08 '13 at 15:38
  • 1
    @whuber: Let me stress, it was a comment, not an answer. I definitely not an expert on bootstrapping; I would generally argue that outliers are *legitimate* data, if they are not obviously corrupted observations. Nevertheless if one bootstraps a somewhat small sample that has obvious outliers I would worry that he could end up "amplifying" their influences or overlooking sample heterogeneity. – usεr11852 Oct 08 '13 at 15:46
  • 2
    @user11852 Your general argument that outliers are legitimate is helpful. But, concerning the possibility of amplification, it seems to me that the contrary is true: bootstrapping has a chance of working only if the full sample is used. Otherwise it presents a fairy tale, telling us how things would be if outliers didn't exist--but obviously they do. The larger problem is that bootstrapping has little theoretical justification when applied to small samples: the theory is an *asymptotic* one. – whuber Oct 08 '13 at 15:53
  • @whuber: I agree with what you say. Regarding your median comment: I guess an issue is that I treated revenue as roughly equivalent to income. Usually a debate between using mean or median does arise there; eg. in household income cases. (In retrospect, having just <10% of the user data generating revenue definitely it is not a good assumption.) Also I didn't mean to imply that a median "is the mean" or something like that. I specifically mentioned it *as a "central tendency"* value. – usεr11852 Oct 08 '13 at 16:17
  • @user11852 And that's the crux of the matter: central tendency is not terribly meaningful when tracking revenue; only the sum (or equivalently, the mean) is. As far as the less than 10% goes, that depends on the business. Plenty rely on just covering costs with routine transactions and making profits on a small number of very large or high-profit sales. *E.g.*, one model for how an airline could make a profit is from the outsized margins reaped from the very small number of first-class passengers. – whuber Oct 08 '13 at 16:24
  • 1
    @whuber: Cool, thank you for the insight on the matter! – usεr11852 Oct 08 '13 at 16:59
  • 2
    This is an important question (+1). Can you add a small sample of your dataset or a simulated sample resembling it to the question? I think providing an illustration will be more fruitful in this case. – user603 Oct 08 '13 at 17:31

2 Answers2

8

Before addressing this, it's important to acknowledge that the statistical malpractice of "removing outliers" has been wrongly promulgated in much of the applied statistical pedagogy. Traditionally, outliers are defined as high leverage, high influence observations. One can and should identify such observations in the analysis of data, but those conditions alone do not warrant removing those observations. A "true outlier" is a high leverage/high influence observation that's inconsistent with replications of the experimental design. To deem an observation as such requires specialized knowledge of that population and the science behind the "data generating mechanism". The most important aspect is that you should be able to identify potential outliers apriori.

As for the bootstrapping aspect of things, the bootstrap is meant to simulate independent, repeated draws from the sampling population. If you prespecify exclusion criteria in your analysis plan, you should still leave excluded values in the referent bootstrap sampling distribution. This is because you will account for the loss of power due to applying exclusions after sampling your data. However, if there are no prespecified exclusion criteria and outliers are removed using post hoc adjudication, as I'm obviously rallying against, removing these values will propagate the same errors in inference that are caused by removing outliers.

Consider a study on wealth and happiness in an unstratified simple random sample of 100 people. If we took the statement, "1% of the population holds 90% of the world's wealth" literally, then we would observe, on average, one very highly influential value. Suppose further that, beyond affording a basic quality of life, there was no excess happiness attributable to larger income (nonconstant linear trend). So this individual is also high leverage.

The least squares regression coefficient fit on unadulterated data estimates a population averaged first order trend in these data. It is heavily attenuated by our 1 individual in the sample whose happiness is consistent with those near median income levels. If we remove this individual, the least squares regression slope is much larger, but the variance of the regressor is reduced, hence inference about the association is approximately the same. The difficulty with doing this is that I did not prespecify conditions in which individuals would be excluded. If another researcher replicated this study design, they would sample an average of one high income, moderately happy individual, and obtain results that were inconsistent with my "trimmed" results.

If we were apriori interested in the moderate income happiness association, then we should have prespecified that we would, e.g. "compare individuals earning less than $100,000 annual household income". So removing the outlier causes us to estimate an association we cannot describe, hence the p-values are meaningless.

On the other hand, miscalibrated medical equipment and facetious self-reported survey lies can be removed. The more accurately that exclusion criteria can be described before the actual analysis takes place, the more valid and consistent the results that such an analysis will produce.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • I'm not sure I understand why "*if you prespecify exclusion criteria in your analysis plan, you should still leave excluded values in the referent bootstrap sampling distribution.*" You mention that this is "*because you will account for the loss of power due to applying exclusions after sampling your data.*" I don't see why it is assumed that applying exclusion criteria after sampling leads to loss of power, nor how/why leaving the excluded cases in the bootstrap sample "accounts for" (?) this, nor further why this is something that clearly must be "accounted for." Maybe I'm being dense here. – Jake Westfall Oct 11 '13 at 00:46
  • Well it depends on your sampling rule. If you collect data on 100 individuals and 5 of them are ineligible & excluded, you *could* bootstrap resample 95 observations from the 95 eligible participants, but that wouldn't reflect the fact that if you resampled 100 individuals at random from the population, potentially 10 or 8 or 4 or 0 of them would be ineligible according to your study specifications. This kind of uncertainty affects the distribution and interpretation of the $p$-value under the null hypothesis. Remember, the bootstrap is meant to simulate this kind of sampling. – AdamO Oct 11 '13 at 01:02
  • Hmm, my thinking was that if one did specify the exclusion criteria in advance -- so that we are explicitly not interested in certain types of cases, and presumably future study replications would use these same exclusion criteria -- then it would make sense to leave those cases out of the bootstrap sample, as they are a segment of the population that we do not wish to make any inferences about. I do see how future replications might end up excluding a different proportion of cases, but I can't quite make the connection to why this matters for the cases that we explicitly *are* interested in.. – Jake Westfall Oct 11 '13 at 01:16
  • 1
    "then it would make sense to leave those cases out of the bootstrap sample, as they are a segment of the population that we do not wish to make any inferences about." I'm saying allow the bootstrap to sample these cases, remove them from the model fit upon the bootstrap sampled population. Doing this allows the effective sample size of each BS iteration to vary. This way, the $p$-value's distribution under $\mathcal{H}_0$ depends on sample size uncertainty (i.e. not knowing how many cases in a fixed sample from an imperfect population will need to be discarded) – AdamO Oct 11 '13 at 20:21
0

Looking at this as an outlier problem seems wrong to me. If "< 10% of users spend at all", you need to model that aspect. Tobit or Heckman regression would be two possibilities.

JKP
  • 1,349
  • 10
  • 7