Ratio that accounts for different sample sizes

Question

I'm trying to get a measure for a set of data that indicates someone's "spam score". Essentially, the higher the spam score, the more likely they are to be spammers.

Right now, I'm measuring a person's spam score as a ratio: bad posts/total posts. If this ratio is high, then they're likely to be spammers.

The difficulty I'm having is trying to compare different people - for example, a person with 6/8 bad posts is not AS likely to be as spammy as someone who has 600/800 bad posts (the person who's made 600 bad posts is clearly a spammer, but the one who's made 6 has not proven themselves to be a spammer to that extent).

However, right now they are being assigned the same spam score of 6/8 = 0.75. Is there a way I can account for the size of the sample in my spam score?

This is a great situation for some type of Bayesian analysis. You can start everybody with a prior probability of being a spammer that is equal to the estimated proportion of spammers in the population. Then update the probability that they are a spammer w/ each new post. This will account for the uncertainty associated w/ small numbers of posts. — gung - Reinstate Monica, Feb 06 '14 at 21:25
@gung I agree with you, but I'm guessing the OP is looking for a basic solution and doesn't have much experience with Bayesian statistics. Of course, you can use a Beta prior so the results can be updated without montecarlo, so maybe that wouldn't be any harder to accomodate then the $z$-score solution I provided below. — Sam Dickson, Feb 06 '14 at 21:57

Sam Dickson · Accepted Answer · 2014-02-06T21:58:33.327

A simple way to do this is to use a test that the proportion is greater than some threshold--for example, that your proportion, $p$, is greater than 0.5. The most basic one sample proportion test is $$z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$

The $z$ score can then be tested in the usual way to get a $p$-value.

For instance, in your example if you had 600 out of 800 posts and you wanted to compare to 0.5, then you would calculate $$\frac{0.75-0.5}{\sqrt{\frac{0.5^2}{800}}}=14.14$$

If we want to compare 600 out of 800 to 6 out of 8, note that the only difference will be in the $n$ in the denominator, so we'll just cheat and calculate 14.14/10 = 1.414.

If we plug these two numbers into our handy normal probability calculators, we'll get $p$-values of basically 0 for the first one and 0.0787 for the second.

This test is not perfect, but it will probably work just fine for your problem. All that's left would be calibrating it with the values of $p_0$ and $\alpha$ (what you'll compare your $p$-value to) that will give you satisfactory results.

+1, This is a perfectly viable approach. I was actually thinking about using a conjugate prior above. Another hack would be to start everyone off with a fixed number of good & bad posts, to initially bias them towards non-spammer. This isn't exactly valid / ideal, but quite similar to the Bayesian approach & easy for anyone to implement. — gung - Reinstate Monica, Feb 06 '14 at 22:04

Ratio that accounts for different sample sizes

1 Answers1

Linked