What kind of analysis would be appropriate for my data?

Question

I'm working on a project looking at the geographic distribution of a type of physician in the United States. My data is as such:

I would like to identify zip code level characteristics that predict the presence of a doctor in that zip code. I was planning on doing a binary logistic regression on SPSS.

However, my data is interesting in that there are only 376 zip codes in the United States that have at least one of these kinds of doctors (and 40,0661 zip codes that do not). What kind of analysis would you recommend for this kind of data?

If you use a model that only includes the intercept and just predicts that every zip code does not have a doctor, then the model is correct >99% of the time. When I run the regression on SPSS and the model includes other variables, it still does not predict any of the zip codes having a doctor (but is still correct >99% of the time). My stats background has been primarily self-taught, so I may have a misunderstanding... — user2930701, Oct 14 '19 at 18:28
@user2930701 That's good reasoning, but I think a different conclusion needs to be drawn from it: the question is, to what extent can one do any *better* than being correct 99% of the time? Another approach is this: how well can you do if you are given the total number of doctors and asked to guess which zip codes have at least one, based on the kind of Census data shown in the question? — whuber, Oct 14 '19 at 18:38
@whuber Appreciate the response. Could you explain the second approach again? Thanks! — user2930701, Oct 14 '19 at 18:46
You tend to evaluate the probability forecasts given by a logistic regression by accuracy, which is not a *proper scoring rule*. Have a look at https://stats.stackexchange.com/questions/109851/using-proper-scoring-rule-to-determine-class-membership-from-logistic-regression and search this site! — kjetil b halvorsen, Mar 11 '21 at 02:52

What kind of analysis would be appropriate for my data?

0 Answers0