Can I use a Bayes factor to compare cells in a contingency table?

Question

I'm trying to compare which features of a website more active and less active users make use of. I've divided up the users into "active" and "inactive" and there are several page types they can visit. So now I have a contingency table like this:

user type| feature1 | feature2 | feature3 | feature4 | feature5
---------------------------------------------------------------------
active      |   1000   | 2000      | 3000      | 4000     | 5000
inactive   |  50000   | 40000    | 30000    | 20000   | 10000

So now I want to figure out which features are over-represented in the usage patterns of active users compared to inactive users.

Is comparing the conditional probability of cells in each column a reasonable way to do this, e.g.

P(feature5 | active) / P(feature5 | inactive) = 1/3 / 1/15 = 5

So in this case active users seem to be 5 times more likely to make use of feature5.

Is that a fair interpretation of the odds and if not what are the problems with that interpretation?

A. Donda · Accepted Answer · 2015-07-29T05:32:38.743

1

You can certainly look at ratios between probabilities, estimated by ratios between counts.

However, I wouldn't call those ratios odds, because this term is usually reserved for the ratio between the probability that a particular event will take place vs the probability that it won't, given some circumstances. The way you write it, you are looking at the ratio between the probabilities for the same thing ("feature5") given different circumstances ("active user" vs "inactive user").

I wouldn't talk about a Bayes factor either, because that term refers to the ratio between the probabilities of different models given some data. Aside from the fact that you don't formulate models, again the conditioning is in the wrong direction: You vary the "data" where you should vary the "model", and you you keep the "model" constant where you should vary it.

Update: I assume you want to test the null hypothesis that the probability for "feature5" is the same for "active" and for "inactive" users. This question can be reformulated: Is there a significant association between "feature5"/"not feature5" and "active"/"inactive". This can be tested using Fisher's exact test. In this case, the relevant contingency table is

          feature5  not feature5
active      5000       10000
inactive    1000      140000

Using Matlab's implementation of the test, fishertest, I get an output of $p = 1.97 \cdot 10^{-323}$.

edited Jul 29 '15 at 05:32

answered Jul 29 '15 at 03:39

A. Donda

2,819
14
32

Thanks! That makes sense concerning the terminology. One question I still have though is about the significance of the difference of these ratios. Is the ratio of 5 I use in my example signficant? Or perhaps a better question is what is a model that I could use for the count data to make the kind of comparisons I want to make? – 2daaa Jul 29 '15 at 04:20
@Ranjit, see my update. – A. Donda Jul 29 '15 at 05:32
Actually, I have a question about the use of the Fisher exact test here. According to that Wikipedia article, the test assumes the data is drawn from a hypergeometric distribution. Doesn't that imply the assumption that there a fixed number of page views that were being allocated by each category of users? It clearly makes sense for the example in the article where there a fixed number of subjects in the experiment, but I wonder how that assumption might bias the results in this case. Thanks for your help! – 2daaa Jul 29 '15 at 17:16
@Ranjit, later on the Wikipedia article states, "Another early discussion revolved around the necessity to condition on the marginals. Fisher's test gives exact p-values both for fixed and for random marginals. Other tests, most prominently Barnard's, require random marginals. Some authors (including, later, Barnard himself) have criticized Barnard's test based on this property. They argue that the marginal totals are an (almost) ancillary statistic, containing (almost) no information about the tested property." – A. Donda Jul 29 '15 at 18:50
I'm not familiar with the details of this discussion, but in my understanding the formal constraint of fixed marginals is not a reason not to use Fisher's exact test. The concern is not about the validity, but about the power of the test. But if you're worried about this, consider using [Barnard's test](https://en.wikipedia.org/wiki/Barnard%27s_test) instead, or – A. Donda Jul 29 '15 at 18:51
Or look at Pearson's $\chi^2$-test and other alternatives discussed [here](http://stats.stackexchange.com/a/14230/17023). They also become relevant if you are interested in testing whether "active"/"inactive" has an effect on the use of all features in general (your original 2x5 contingency table). – A. Donda Jul 29 '15 at 18:59

Can I use a Bayes factor to compare cells in a contingency table?

1 Answers1