6

My data consists of occurrences of words in time windows. E.g.:

Day; Word; Frequency
1; "dog"; 45
1; "cat"; 2
...
2; "dog"; 90
2; "cat"; 4
...

I would like to estimate the ratios of all day-to-day differences (i.e., for dog day 1->2: 90-45/45 = 100%). For cat the increase is also 100%, but due to the small sample size I would like to somehow quantify that it is "less trustworthy".

Something similar (for binomial data) is proposed here:

http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

But with count data it's not quite the same...

Any ideas are most welcome.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Count you not just switch from the binomial confidence interval to the Poisson? Something like $\frac{1}{2} \chi^2(\alpha; 2k)$ for your $\alpha$ lower bound and then compare if $\lambda$ has changed between yesterday and today? – Corvus Sep 27 '13 at 10:29
  • Also what are you going to do when a word has a zero count for a day? Infinite % increase? – Corvus Sep 27 '13 at 10:33
  • Thanks for your answer. So, should I try some kind of Poisson regression on the word counts without bothering about the magnitude of number of occurrences and then compare the (inferred) consecutive lambas? – dimitris fekas Sep 27 '13 at 11:45
  • That sort of depends on what you actually want to do with this data - what are you wanting to show/find/infer? – Corvus Sep 27 '13 at 12:12
  • This is for a web application, I only need some kind of "weighted" measure of the daily increase/decrease. So some way to take into account the confidence for each increase (as in dog/cat example in the main post). Ideally, the infinities you mentioned in your previous comment should be dealt with as well. Thanks. – dimitris fekas Sep 27 '13 at 12:26
  • The final goal is to rank the words daily, according to their percentages of increase (and not ranking cat & dog equally) – dimitris fekas Sep 27 '13 at 12:34

1 Answers1

2

To keep things really simple, you could consider using a simple mean/standard deviation inspired ratio, a bit like a z-score?

If you assume that the counts for two days, $X_1$ and $X_2$ are Poisson random samples with $\lambda_1$ and $\lambda_2$ respectively, then the change in word count follows a Skellam distribution, with mean $\lambda_2-\lambda_1$ and variance $\lambda_2+\lambda_1$

Taking simple point estimates, I think it would therefore be reasonable to construct:

$\mathrm{Score} = \frac{X_2 - X_1}{\sqrt{X_2+X_1}}$

So in your example,

$\mathrm{Score_{dog}} = \frac{45}{\sqrt{135}} = 3.87$

$\mathrm{Score_{cat}} = \frac{2}{\sqrt{6}} = 0.816$

You could consider more difficult inferences if you have a strong idea what your really want to detect, but based on your description I think the above will be nice and simple and capture roughly the behaviour you want.

Corvus
  • 4,573
  • 1
  • 27
  • 58
  • I thought about something like this as well, I am just wondering what would be the theoretically best way to combine "Score" with the actual precentage. So, again in the cat/dog example: Suppose the data for dog, day2 is 88 instead of 90. Then the ratio will actually be smaller than cat's 100%, but still I would like to rank it higher. – dimitris fekas Sep 27 '13 at 12:51
  • @dimitrisfekas That's exactly what this would do? If it was 88 instead then the score for dog would be $\frac{88-45}{\sqrt{88+45}} = 3.73$ which is still a lot higher than 0.816 for cat? Cat would only outrank dog if dog's second day was 53 or lower ($\frac{53-45}{\sqrt{53+45}} = 0.808$) – Corvus Sep 27 '13 at 12:58