5

I am trying to find the optimal estimator for the maximal expected $\Sigma X_i$ where $X_i$ is sampled from an unknown distribution which is chosen to be maximal. To clarify and simplify, there are two sets of data $Y_i$ and $Z_i$ which are drawn from similar but likely different pdf distributions $y$ and $z$. I would like an estimator on $Z_i$ and $Y_i$ which would predict which pdf would produce a larger sum over a future sampling, ie $\Sigma X_i$.

The number of elements in each set $N_y$ and $N_z$ is chosen by our implementation so we have to normalize the number of elements out. Or put another, way we are more interested in the mean $\mu = \Sigma X_i / N_x$ because our future sampling has a fixed $N_x$ independent of the choice of $y$ or $z$. This leads to the standard method of making a choice based on each scenario's mean, $\mu$, and error on mean $\sigma_\mu$. What I am interested in is if the mean is really the best estimator to choose the scenario which leads to the highest expected $\Sigma X_i$. In a simple scenario I am sure the answer is yes but there are some complications.

There is no good model for the underlying distributions which produces $Y_i$ or $Z_i$ but a histogram of either decreases exponentially. Hence the occurrence of a few high large value outliers which make things complicated. The main reason I think the mean may not be the best is because of its sensitivity to these outliers. The solution of removing outliers and doing a bootstrapping method is discussed here: Bootstrapping - do I need to remove outliers first? From this I think it is clear that the removal of outliers is ill advised. Additionally, the number of elements, $N_y$ and $N_z$, is in the millions so the CLT asserts we do not need to bootstrap at all to obtain $\sigma_\mu$.

Another complication is that the distribution is quantize/discrete in two ways. To illustrate this lets take the example of optimizing for future revenue in a split test. There are two scenarios that lead to two sets $Z_i$ and $Y_i$ representing the amount spent for each customer. To illustrate that the underlying pdfs are not continuous one must first off note that the smallest unit of money is a penny so the range of each pdf is the integers. Secondly and more importantly the items to buy do not span the space. Let say for example there is one item that costs \$20 but no other items in the price range [\$15,\$25]. Since there are a lot of single purchase customers there will be a peak at \$20.

The simplest choice of an alternative estimator would be the median, $\tilde{x}$ since it is less sensitive to outliers and insensitive to reparameterizations. However, since greater than 95% of customers spend \$0 the median will always be \$0. Also, I am not sure the median is a good estimator of the expected $\Sigma X_i$ in general.

So the question is, is there a better estimator than the mean of $Y_i$ and $Z_i$ for maximum expected $\Sigma X_i$? The word "better" in the context refers to the sense of gambling where the best bet is the best.

Keith
  • 510
  • 4
  • 18
  • 4
    You lost me in the first sentence with "maximal expected." If the $X_i$ are a sample then its mean has an *expectation* which is a *number*: what is "maximal" about that? In the following sentences you jump right in with references to "revenue" and "customers" without providing the context to understand how they might be related to the $X_i$. I'm afraid I stopped reading right there and suspect many others who otherwise might be able to help you would be similarly disinterested. Do you think you could edit this post to help people understand what you are doing? – whuber Jul 11 '14 at 15:25
  • Sorry, there is a large jargon gap here between differing fields. I still think the revenue example adds clarity but I have added more to help clarify my question. – Keith Jul 11 '14 at 23:22
  • 2
    If I'm understanding you correctly, this is analogous to having several choices to sample from. At each round you select a choice to sample from. The choice you selected will yield a reward. You want to maximize this reward. Each choice has an unknown distribution and parameters. This is the multi-armed bandit problem, which is essentially solved. If you have side information about each choice, this is the contextual bandit problem. – Jessica Collins Jul 11 '14 at 23:37
  • Yes, I think this is the answer except there is only one round and the reward is $\Sigma X_i$ where the elements are drawn from your choice. The information about each choice is given above. They are exponentially decreasing, most elements are 0 and pdfs are discrete in an complex way. If you post your comment as an answer and include some mention of the specifics in this particular context I will choose it as the answer. – Keith Jul 12 '14 at 08:57
  • There is another modification to the multi-armed bandit problem which makes the customer revenue example more fitting. A customer can purchase more than one item but a slot machine has one payoff per pull. Also, the fact that you know the previous million results and have to choose and stick with one machine based on that information makes it seem to be a different problem to me since in the multi-armed bandit problem you can change your choice. – Keith Jul 12 '14 at 10:51
  • Your vocabulary here is a bit foreign to me. Could you edit your original question to include data dictionary (what do the rows and columns in your design matrix represent), along with what you want a model to do? – Jessica Collins Jul 12 '14 at 13:25
  • @JacobMick I am not in BI but after some research I think I can translate for the example I give in the question. A split test is an A/B test and in this case the two variants are called $Y$ and $Z$. They could be represented as two tables with one column each where the rows represent each customer and the values are the dollars spent. The question is which estimator (eg $Y.mean()$) is best to decide which group ($Y$ or $Z$) will produce highest future profit. This would be represented as $X.sum()$ where the data $X$ is the spend for each customer collected after the decision is made. Clear? – Keith Jul 12 '14 at 17:44
  • So you have historical purchase data of N customers. You want to select which customers are the most profitable? To what end? Are you trying to direct marketing efforts? This doesn't appear to be a multi-armed bandit problem. $Y$ and $Z$ are not stationary distributions you're sampling from. In fact $Y$ and $Z$ are time-series of purchasing events. They could be summed over weekly periods then traditional time-series forecasting methods could project into the future. You could attempt to measure the impact of marketing efforts. Is this close? – Jessica Collins Jul 12 '14 at 20:49
  • Yep pretty much. The historical purchase data is for two sets of customers exposed to different A/B test variants $Y$ and $Z$ (eg changes in marketing). I want to know which variant is better based on the data sets and the prior information I gave in the question. One of the variants will be chosen as a standard for the future and clearly I want to maximize future revenue. I am open to time series analysis but I don't see how it would add anything. It is close enough to a multi-armed bandit problem that I am making some progress with a variation of Thompson sampling. – Keith Jul 13 '14 at 00:31
  • @Keith, feel free to answer your own question. I'd be interested in what you've learned. – Jessica Collins Jul 17 '14 at 03:02
  • 1
    So far all I have proven is that it is the same as in the case where you sample from a discrete set and that I can take that set as Bayesian prior information. I have then asked the simplified question here to keep the ball rolling. http://stats.stackexchange.com/questions/107848/drawing-numbered-balls-from-an-urn – Keith Jul 17 '14 at 07:26
  • 1
    "where the best bet is the best" doesn't define anything, since you don't define what makes a "best bet" *best*. It just shifts the "what does 'best' mean" problem around (dIfferent bets will be better as you change your utility function/loss function) – Glen_b Feb 20 '16 at 07:22
  • Ugg I swear you know what I mean... Over many trials you are most likely to be correct. – Keith Feb 21 '16 at 07:56
  • You mention X in the first paragraph and in the second one you come up with Y and Z, how are the latter two related to the sum of the X? –  Feb 22 '17 at 06:50
  • The best way to think of it is in terms of AB testing. The A and B for the test would be Y and Z. Given Y and Z you want to choose a data source or underlying pdf to sample from. That new sample would be X. Once you have made your decision based on Y and Z you get the X data set sampled from the pdf of your choice. You want to maximize the sum of X. – Keith Feb 22 '17 at 21:28

1 Answers1

1

There are two ways to answer this problem.

First: The canonical approach is to break it down in to several sub problems. For example, if the underlying function you are sampling from is similar to the convolution binomial and a Gaussian. For something like car purchases, most people don't buy a car (buy Y/N is binomial) but those who do tend to buy at a price which is approximately Gaussian. You can then use Fisher's exact method and Welch's t test for the two functions. If they both imply the same sample is better then you are done. If not then you have to find a way to combine. Another, issue is that both the underlying functions are assumptions. This means that they are wrong to some degree and may be missing something important.

Second There is another method which is insensitive to outliers and does not need any knowledge of the underlying distribution. This is called the Mann–Whitney U test. It is competitive with the standard t-tests in power and has way less assumptions.

My final plan is to use both these methods and see if the story I am getting is consistent. If inconsistent then I will investigate further and try to find out why.

TL:DR If your underlying distribution is not fully known use the Mann–Whitney U test

Keith
  • 510
  • 4
  • 18