I have a dataset on avoided maritime accidents (near-miss) that looks like this:
[
All variables are categorical (type=1-3, position=1-5, area=1-5, risk=1-7, 4 columns, 525 rows - every row is 1 near miss described) and I encoded them in R.
From my own and other experiences, I know that accident avoidance data are often fabricated to meet bureaucratic forms for future inspections. If there are no avoided accidents, almost every company expects you to invent and report them anyway.
Before I do any analysis of this set I would like to test if there is significant fabricated data.
I am familiar with Benford's law (for economics and finance frauds), but I am interested in the following:
- How to use Benford's law when it comes to categorical variables?
- Is there are other (statistical or ML) ways to detect fabricated data in raw data?
- If not which methods do you recommend to analyze and find any structure this dataset?