1

I'm trying to test if a dataset follows Benford's Law (https://en.wikipedia.org/wiki/Benford%27s_law), which basically says how many values in a data set we'd expect to have a first significant digit (i.e. start with) 1,2,...,9.

Here's some actual data.

1       2       3       4       5       6       7       8       9       FSD
0.301   0.176   0.125   0.097   0.079   0.067   0.058   0.051   0.046   Benford
0.305   0.179   0.126   0.098   0.077   0.064   0.057   0.049   0.046   Observed

As you can see, the observed data is SO close to what Benford expects. I'm trying to argue that Benford is a good model for expectations, but the standard Chi-squared says it is not a good match, since this particular observed data is over 25,000 points. Essentially, the large size of my data set makes the frequency difference look huge. Yet obviously, Benford's Law is a perfect model for this data.

My question: is it statistically correct to do chi-squared with the proportions instead of the frequencies? I know it can be done (I read Can chi square be used to compare proportions?), but I'm more concerned that reviewers of my paper will say that's incorrect.

Jay
  • 11
  • 1
  • 3
    Your last sentence is puzzling. Chi-square tests of the kind discussed here may easily be recast or presented as testing a hypothesis of some specified set of proportions, but the principle remains at root that they are for counts. If there is debate on this, it's misinformed. Incidentally, testing Benford's Law depends on the data spanning various orders of magnitude. If you have data on adult female height in inches, your first digits are likely to be just 5, 6, 7, and Benford's Law won't hold, but that does not mean that the data are faked. – Nick Cox Feb 09 '15 at 15:57
  • What's the P-value? Perhaps you are just getting a signal that Benford's Law is not the only law in town. – Nick Cox Feb 09 '15 at 15:59
  • My data is indeed over several orders of magnitude. And as you can see in the sample data I posted, it fits Benford's Law pretty perfectly. For 8 degrees of freedom and with frequencies from my sample size on this data I shared (19,500), the p value is 0.33. – Jay Feb 09 '15 at 16:04
  • I know there are many laws, but I do not accept that Benford is an inappropriate model here given my data. If the stats say it's not a good fit, I am not using the correct stats to make my argument. – Jay Feb 09 '15 at 16:05
  • Also, when I run this over my larger datasets, with 64,000 people, the CV is 22368, which gives a ridiculous P value. On samples that large, if observed data deviated from Benford by 0.001% it would look significantly different. But it's still a good model for what's observed in a general sense, and I need a way to argue that. – Jay Feb 09 '15 at 16:07
  • But that means **your data are consistent with Benford's Law**. If the Law fitted almost perfectly, chi-square testing would yield a very small chi-square statistic and a large P-value. I thought you were complaining about a rejection that clashed with the perception that there's a good fit. – Nick Cox Feb 09 '15 at 16:08
  • No idea what you are doing with "CV"; that's not explained and may or may not be sound. – Nick Cox Feb 09 '15 at 16:09
  • Please post the actual observed frequencies so that your calculation can be checked. – Nick Cox Feb 09 '15 at 16:11
  • 1
    The attitude that it's the job of statistically-minded people to provide arguments for the interpretation you prefer is at best dubious and at worst... much worse. The stance should always be: This is my interpretation. Is it statistically sound? I don't want to over-react to your wording, but there is a key principle at stake. – Nick Cox Feb 09 '15 at 16:14
  • Here is some actual data where i think I should be getting a statistical match and I'm not (p value is very tiny): 59693 35077 24655 19209 15136 12488 11184 9509 9048 (I can't get the formatting right, but those are frequencies for 1-9 respectively) Sample size is 196,000 – Jay Feb 09 '15 at 16:17
  • Nick - I totally understand your point. But my issue is basically that I'm not sure chi-squared is an appropriate test for what I'm trying to show. Looking at this data, it is clear that Benford is a good general model. Correlations are 0.999 or better. If chi-squared rejects it, so be it, but there must be some other way to show a good fit that is clearly there. – Jay Feb 09 '15 at 16:23
  • As an applied example, when Benford is used in forensic accounting, they don't flag your books as potentially fraudulent because they have 32% FSDs of 1 instead of 30.1%. If you generally follow what Benford expects, everyone would agree that it's reasonable. – Jay Feb 09 '15 at 16:25
  • 3
    Benford provides a near-perfect - but not quite perfect - fit. However, because you have *so much data*, any decent goodness of fit test can detect that there are tiny deviations. Indeed, you *should expect to see consistent deviations in almost any real data set*, since the arguments by which Benford's law should hold are not *exact* for data over finite ranges. Hypothesis tests are not the right tool for the questions of real interest here. – Glen_b Feb 09 '15 at 16:34
  • 25,000; 64,000; 196,000. This is getting complicated. Perhaps you should go back and edit your question and tell us more in your question. Comment threads can get so long that people just won't read them. – Nick Cox Feb 09 '15 at 16:36
  • Thanks @Glen_b - that's sort of the point I've reached in thinking about this. It's too bad, though, because when I submit this paper, I know the first instinct of the reviewers will be to ask for the statistical test. It's going to be hard to say "No no, that's not what I need here, but if you look, it's obviously a good fit." Not that you can solve this problem for me - but thanks :) – Jay Feb 09 '15 at 16:36
  • I'd address the problem head on by discussing it right in the paper. – Glen_b Feb 09 '15 at 16:38

0 Answers0