2

This is a theoretical problem that I'm trying to calculate whilst at work.

Essentially, what is the probability of a car accident in a state given that these variables are provided to you:

  1. Counties within the US States
  2. US States
  3. Number of car-crashes within the Counties
  4. Year
  5. Miles traveled; average and total
  6. Vehicle speed

My initial approach was to include a poisson distribution following these variables as parameters:
$\lambda = rt$
$r$ is the accident rates which is calculated by the number of car-crashes within counties multiplied by the average miles traveled, and all of it divided by the total miles traveled of all vehicles in the county.
$t$ is the time, in this instance I'm working with data over 8 years so $t = 1, 2, 3, 4, 5, 6, 7,8$
$k$ = number of accidents in a county

The problem I have with using the poisson distribution is that higher accidents are under-represented relative to smaller accidents. Which tells me that places with lower accidents have a higher probability of an accident occurring than places with more accidents, though I disagree.

Some sample data:

#merged_counties
   States     counties average_miles accidents total_miles  TRAV_SP
1 Alabama  AUTAUGA (1)      723.8889        16        6515 54.00000
2 Alabama  BALDWIN (3)      192.7250        65        7709 44.23529
3 Alabama  BARBOUR (5)      569.2857         9        3985 50.42857
4 Alabama     BIBB (7)      599.8000         8        2999 54.28571
5 Alabama   BLOUNT (9)      349.9231        18        4549 56.05882
6 Alabama BULLOCK (11)      705.0000         3        1410  5.00000

What alternative model would better capture the idea of the probability of a car-accident? I understand that more variables must be considered like weather, types of road, the persons psychology at the time etc... But I'm aiming for the most simplest framework.

Assuming that the total accidents in Alabama are 1000, shouldn't Baldwin have a higher probability of an accident occurring, but the calculation says otherwise?

accident_rate <- (merged_counties$accidents*merged_counties$average_miles)/merged_counties$total_miles


x <- 0
for(i in 1:6){
x[i]<-((accident_rate[i]*1)^(all_data_counties$accidents[i])*exp(-accident_rate[i]*1))/factorial(all_data_counties$accidents[i])
}
x
[1] 8.041571e-11 1.211741e-78 7.314061e-06 2.150642e-04 1.368526e-14

Note: I miss-interpreted the poisson distribution as $k$ should be 1 as I'm looking for the calculation of a single accident. The probabilities actually turn out rather handsome as a result as opposed to the values in the $R$ code above.

Stackcans
  • 321
  • 7
  • Why do you use $\lambda = rt$? Poisson distribution is parametrized by rate understood as "number of events in other specified interval", I'm not sure if $rt$ is what you want it to be. – Tim Sep 07 '21 at 15:06
  • @Tim I've had the alternative idea of taking the mean of accident rates, given that $\lambda$ is the expected value of a poisson distribution? Though I still agree with you and it may be so that $rt$ is not the parameterization I'm after. Do you have an alternative suggestion? – Stackcans Sep 07 '21 at 15:18
  • 1
    accidents / miles ? – Tim Sep 07 '21 at 15:47
  • @Tim This seems to worsen the probabilities. The aim was that accidents * average miles / total miles indicates accidents rates per average mile traveled. Though I'm unsure of the poisson model at this stage for this idea given that more accidents should serve better, however they have a far lower probability – Stackcans Sep 07 '21 at 16:13
  • Why you need to divide average by total? What do “average” and “total” mean in here? – Tim Sep 07 '21 at 16:34
  • @Tim Essentially the idea of accident rates was taken from various articles and information online. The average represents the mean average of miles traveled for all car incidences recorded in the county. The total miles traveled is the total miles traveled of these recorded incidences in the county. – Stackcans Sep 07 '21 at 16:37
  • You should drive more (at least in different places) so that you have an understanding that the safety of a road relates to local weather events (like flash floods which can leave debris) and the ability of the county to keep the road safe. Also, are the police active or even able to control people who regularly drive in impaired conditions, at excess speeds, etc. Hence, the significance of the geographic factor. – AJKOER Sep 07 '21 at 16:39
  • @AJKOER I made reference to this in the question and know that these factors do impact accidents however, I'm aiming for the simplest model with the data that I have. – Stackcans Sep 07 '21 at 16:42
  • @Tim I believe my mistake is that I'm trying to calculate the number of accidents that have occurred as opposed to a single accident occurring given the number that have occurred ... Thanks for the help Tim! It seems that 'Accidents / miles' works more efficiently for those larger values. I've also had a second thought on the exposure to accidents, and whether I could do (accident risk * exposure) - Though what are your thoughts on the calculation for exposure? I was assuming the number of vehicles in a county * average miles traveled? – Stackcans Sep 07 '21 at 16:49
  • The variable "Year" is an interesting proxy variable. At times, there are just fewer cars on the road even at pick hours due to economic recessions, pandemics.... So the variable "Year" is a proxy dummy variable for average traffic density. Relatedly, think of a kinetic collision model. – AJKOER Sep 07 '21 at 16:52
  • Also, the variable "Year" is a potential trend variable for climate change.Think of a kinetic model for gases (in effect, gas molecule collisions) that are subject to an increasing heat source. – AJKOER Sep 07 '21 at 16:59
  • @AJKOER I too think year is a great variable to use. What are your thoughts on calculating accident risk by exposure? Given that the risk of an accident may increase by exposure to something. Such as those variables you mention as weather, and I do have weather but just have not included it within the data provided. Perhaps this could be used in the calculation? Furthermore, I hope to multiply by Year as $t$, though accidents vary over years, as do miles. How do I account for this when multiplying by year? – Stackcans Sep 07 '21 at 17:03
  • Isn’t it rather accidents / total miles / number of cars = average miles / cars? – Tim Sep 07 '21 at 17:35
  • I am, perhaps erroneously, suggesting the application of a different model. See https://www.chem.tamu.edu/rgroup/hughbanks/courses/102/slides/slides17_2.pdf a chemical reaction rate function employing concentration (here traffic density times exposure) per the Arrhenius equation which includes a frequency of collision factor (the actual variable of interest) and the temperature (here year proxy for activity level) of the system. Note: Taking log induces a more interesting relationship to explain collision frequency as a function of variables. – AJKOER Sep 07 '21 at 18:12
  • @Tim I'll have a further look into this. Though I have a question on how to include $t$ as a variable when $r$ varies for each year. I was thinking along the lines of: $\frac{(r_1t_1)^k \cdot e^{-(r_1t_1)}}{k!}\frac{(r_2t_2)^k \cdot e^{-(r_2t_2)}}{k!}\frac{(r_3t_3)^k \cdot e^{-(r_3t_3)}}{k!} ...$ and so forth? – Stackcans Sep 08 '21 at 02:10

1 Answers1

1

You asked about the simplest approach. Poisson distribution is parametrized by the rate $\lambda$ that can be understood "number of events in other specified intervals". The simplest approach would be to set it to something like accidents / miles, so you disregard all the details and just consider a random mile driven by a random car per state.

In the comments you started considering more complicated approaches. If you want to take into account multiple variables, there is no point in overthinking this, just use Poisson regression, you can use the variables you have and their interactions (when meaningful) as independent variables in the model. It will result in simple and interpretable model.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • I agree with you and have opted for ```accident / miles / total cars``` as you mentioned in the comments. This would allow for me to take into account the number of accidents relative to total miles and registered vehicles. I have opted not for the Poisson regression purely on the basis that I required a probability, whereas the coefficients of a Poisson regression would provide me with values too complex to explain to the non-initiated statistician. – Stackcans Sep 08 '21 at 10:06
  • @Stackcans Poisson regression predicts the mean, $\lambda$, of Poisson distribution, so it can be used the same way as the simple approach for calculating the probabilities. – Tim Sep 08 '21 at 10:19
  • Could you provide a practical example of it's application as I fail to see this? I have used: ```summary(glm(VE_TOTAL ~ STATENAME + MILEPT + WEATHER1, accident, family=poisson(link=log)))``` (from raw data here: [accidents]{https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/2019/National/}), are you suggesting that I use the column: $Pr(>|z|)$ and using significance values as having a higher probablity? Though this seems more complex to interpret compared to a scale of 0 - 1, with 1 being the highest. Unless I understood your wong? – Stackcans Sep 08 '21 at 10:50
  • 1
    Given the predicted mean from the Poisson regression, it is possible to calculate the probability that the observed number of accidents will be greater than 0. – Jonny Lomond Sep 08 '21 at 11:50
  • 1
    @Stackcans basically it would be something like `ppois(z, lambda=predict(glm(VE_TOTAL ~ STATENAME + MILEPT + WEATHER1, accident, family=poisson(link=log)), lower.tail=TRUE)` for $\Pr(\hat y > z | X)$. – Tim Sep 08 '21 at 12:02
  • Thank you for this approach! this is new to me and I'll be looking forward to using methods like this more often. +1 – Stackcans Sep 08 '21 at 20:56