Benford's law for categorical variables?

Question

I have a dataset on avoided maritime accidents (near-miss) that looks like this:

[ dataset of avoided maritime accidents1

All variables are categorical (type=1-3, position=1-5, area=1-5, risk=1-7, 4 columns, 525 rows - every row is 1 near miss described) and I encoded them in R.

From my own and other experiences, I know that accident avoidance data are often fabricated to meet bureaucratic forms for future inspections. If there are no avoided accidents, almost every company expects you to invent and report them anyway.

Before I do any analysis of this set I would like to test if there is significant fabricated data.

I am familiar with Benford's law (for economics and finance frauds), but I am interested in the following:

How to use Benford's law when it comes to categorical variables?
Is there are other (statistical or ML) ways to detect fabricated data in raw data?
If not which methods do you recommend to analyze and find any structure this dataset?

Interesting setting, but I don't see how Benford's law could possibly apply. Instead, I would think carefully about what you might expect various statistics to look like in the absence of fraud. For example, might the number of accidents on a ship per year follow the Poisson distribution if accidents arrive randomly at some constant rate? If your intuition is indeed correct, there may be unusually few ships or voyages with 0 near misses compared to the Poisson distribution. (The arrival rate and length of time almost certainly vary between ships which could make this analysis kind of tricky.) — Matthew Gunn, Jan 02 '21 at 17:20
The general approach I outlined above is to argue some pattern would be extremely unusual with accurate data. That may not be possible though. If you had observations clearly known to be fraudulent and not fraudulent, there's more you could do. (eg. train a logistic a regression on the labelled data.) — Matthew Gunn, Jan 02 '21 at 17:32
May i use benford's law if i normalize categorical variables, scale them to the range 0-1? I read that Benfordow's law also applies to continuous variables, e.g., physical constants? I will test the Poisson distribution, and I intend to do a logistic regression with what I have - the y variable will be the cause of near miss (human factor, equipment ..), but i don't know what to do if i don't get any structure and i can't proove that data is half fake? — Mario Mandušić, Jan 02 '21 at 18:55
What do you mean by normalize a categorical variable. That doesn't make sense to me at all. Benford's law isn't some magical incantation applicable anytime you see a number. Speaking loosely, the leading digit of a random variable X may follow Benford's law when distribution of $\log X$ is uniform over an appropriately wide interval (or other related situations). This can arise when there's exponential growth (eg. revenues grow some random percentage each month). How is that in any way related to categorical variables (eg. equipment, personal injury..) where you don't even have numbers? — Matthew Gunn, Jan 02 '21 at 19:31
If a number falls in $[10, 20)$, it will have a leading digit of 1. If it falls in $[80, 90)$, it will have a leading digit of 8. While the interval $[10, 20)$ and $[80, 90)$ may be the same length, in log space, the interval $[\log 10, \log 20)$ is much larger than the interval $[\log 80, \log 90)$. If the probability that X falls in $[10, 20)$ is proportional to $\log 20 - \log 10$ etc..., then you get Benford's law for the leading digit of $X$. — Matthew Gunn, Jan 02 '21 at 19:45
I mean this normalization (0-1) before logistic regression https://stats.stackexchange.com/questions/48360/is-standardization-needed-before-fitting-logistic-regression. Wolfram says " Benford's law applies not only to scale-invariant data, but also to numbers chosen from a variety of different sources. Explaining this fact requires a more rigorous investigation of central limit-like theorems for the mantissas of random variables under multiplication.". If mantissa can be analyzed by Benford law, so can this? — Mario Mandušić, Jan 02 '21 at 20:58

Benford's law for categorical variables?

0 Answers0