0

I'm just looking for ideas for this. Say we want to predict the presence of a certain "thing" in a country by each postcode - let's take the UK, so there could be around 1.7 million postcodes - and you want to build a model where for each postcode, you're outputting either a 1 or a 0.

A constraint could be the "thing" has uptake in 400k postcodes. You can imagine typical demographic data for predictor variables: population, income data, age range...

However I feel independence between the outputs is too strong of an assumption - or at least, I'd like to this before applying models that assume it.

I'm not really sure where to start so perhaps someone could give me some guidance - It wouldn't quite be logistic regression since the output variable would have some spatial correlation. There's also possibly enough training data with this large a set of postcodes - the model could be built using post sectors first, perhaps, but we'd still like a level of spatial correlation included...

Any ideas or food for thought welcome, I'm kind of stumbling around in the dark so far. Am I right for instance in saying that multivariate regression would not work here, because the binary output variables are not independent?

  • I found that earlier question by looking at [questions tagged both "spatial" and "logistic"](https://stats.stackexchange.com/questions/tagged/spatial+logistic?tab=Newest). Perhaps some of the others there might also be useful. – Stephan Kolassa Jun 11 '21 at 07:23
  • I'll have a look at the others - I think using a binomial model is too strong of an assumption, at the start. – Christopher Turnbull Jun 11 '21 at 07:47
  • Or, is there a way to use a binomial model in this case, where you suspect there is not independence between the output variables? – Christopher Turnbull Jun 11 '21 at 07:49

0 Answers0