2

Given the following situation, can an accurate total be calculated?

A document is downloaded from my website.

  • 82% of people use Microsoft Office, which I can't track.
  • 18% of people use OpenOffice, LibreOffice which I can track when the document is read
  • There's 5000 total downloads (so 4100 use MSOffice and 900 use OpenOffice).
  • About 90% of people who use OpenOffice actually open and read the file.

Can I accurately calculate how many TOTAL people read the document if there's an 82% gap in the data? What if 95% of people used MSOffice?

Can a confidence interval be calculated? Assuming MSOffice users behave the same way as OpenOffice users?

(This is not homework, this is a real business situation for me)

  • 1
    It depends on what you mean by "accurate." Certainly you can narrow the total to a value between $0.90 \times 900 + 0=810$ to $0.90\times 900 + 4100=4910$, but to get it any narrower you need to make some (strong, difficult to justify) assumptions about how MS Office users compare to other users. – whuber Jan 30 '17 at 16:34
  • Under the assumption that all users behave the same, you can compute confidence interval for P(reading|download) using the 90% out of 900. It's called binomial confidence interval. This interval will be quite narrow around 0.9 (because 900 is a big number). Then you can predict the number of people reading out of 5000, it will be 5000*0.9=4500. You can compute a prediction interval around that too. See http://stats.stackexchange.com/questions/255570. – amoeba Feb 01 '17 at 17:11

2 Answers2

2

As I understand it, there is a business decision that depends on the total number of file openings which in turn depends on the unknown fraction of Word users who open the downloaded file. The optimal decision should take this uncertainty into account.

As others have noted, sampling theory does not apply here and a confidence interval cannot be computed. Nevertheless a decision must be made --- which may be simply to acquire more information for the time being.

Let me state the problem a bit more formally. You are uncertain about the fraction $\alpha$ of Word users who open the file. Your "sample" of 5000 tells you nothing directly about this fraction. As noted in the comments, the fraction could be anywhere between 0 and 1, producing $$ T = 810 + 4100\,\alpha $$ total file openings. Since this is a real business situation, you must bring whatever additional information you can to the problem. Rather than compute a "confidence interval" based on a sampling distribution (which, as noted by others, is not possible), you can compute a probability distribution for $\alpha$ based on whatever you may know from any source. Let $f(\alpha)$ denote that distribution.

Once you have that distribution (more on this later), you can compute the distribution for $T$ by the change of variables formula: $$ p(T) = \frac{f\big((T-810)/4100\big)}{4100} . $$ You can use this distribution in your decision problem: It characterizes your uncertainty about $T$. You can compute the mean for example. Or you can compute an interval that contains 90 percent of the probability, either by putting 5 percent in each tail or by finding the shortest interval that contains 90 percent (the highest density interval). This is not a "confidence interval." Instead, it characterizes where $T$ is likely to be.

The question remains as to how to come up with $f(\alpha)$. One of the comments suggests that "strong assumptions that are difficult to justify" are required. One interpretation of this comment is that whatever you put into $f(\alpha)$ will directly affect $p(T)$. This is true, but that's the situation you face and you must use whatever information you have (or decide to acquire). If you are the decision maker, then you are the only one to whom the assumptions must be justified.

Notice that increasing the sample size above 5000 adds no information about $\alpha$. Also note that changing the fraction of Word users from 82 percent to 95 percent does not affect what is known about $\alpha$. It does, however, increase the effect of the uncertainty about $\alpha$ on the uncertainty about $T$.

Consider the following way to begin assessing the distribution for $\alpha$. Since 90 percent of OpenOffice users open the file, it may be reasonable to assume that Word users are similar in their behavior. This assumption suggests that $f(\alpha)$ has most of its probability in the "neighborhood" of 90 percent. For example, a beta distribution might provide a useful characterization of your knowledge. In particular, let $$ f(\alpha) = \textsf{Beta}(\alpha|a,b). $$ Then the mean and standard deviation of $\alpha$ are given by $$ \frac{a}{a+b} \qquad\text{and}\qquad \frac{\sqrt{a\,b}}{(a+b)\,\sqrt{a+b+1}} . $$ You could solve $a/(a+b) = 9/10$ for $b = a/9$ and use $a$ to control the uncertainty around the mean of 9/10.

I leave off here. My answer is merely intended to suggest a way of thinking about the problem that may not have occurred to you.

mef
  • 2,521
  • 1
  • 15
  • 14
1

In short: No, there is no accurate calculation possible. And No, a confidence interval can't be calculated either.

Accurate implies that you know the numbers, which you state as impossible. Without numbers, there is no calculation.

OK, the estimation needs less. You know two things for sure:

  • Nd the number of downloads
  • No the number of (unique) opened documents with OpenOffice

BTW -- how do you know, if 90% of all OpenOffice users open your file?

You can be sure, that at least the fraction of No/Nd has opened your file. The value has to be equal or larger; simple logic. So you can say with a confidence level of 100% that this fraction has opened the file.

If you don't know more, you shouldn't assume more. There is no way to prove or -- more importantly -- disprove your assumption(s).

And another word of caution: opening the file doesn't mean reading it. Some people actually inadvertently download and open the file without intending to.

cherub
  • 2,038
  • 7
  • 17
  • 2
    I would not be so strict about not assuming things that can't be proven. In this case I can totally imagine a context where it is very safe to assume that the editor used has no impact on the willingness to open the downloaded file. Actually, I find it harder to image a context where it does. – psarka Feb 01 '17 at 17:10
  • 1
    This is not a helpful answer. Under OP's very reasonable assumption one can compute everything that OP wants to compute. – amoeba Feb 01 '17 at 17:14
  • The OP asked if there is an accurate calculation. It may be strict, but accurate means "no bias". And ultimately a confidence interval will be completely dependent on the underlying assumption; which cannot even be verified. – cherub Feb 02 '17 at 09:22