Logistic Regression/Naive Bayes with highly correlated data

Question

Background: We work with data from sports event, more accurately with data about the spectators of sports events: how many people are being violent, what kind of event is this, etc. We have quite a lot of data from the past few years and we try to find the "right" number of security people we need to minimize the violence while keeping some kind of "budget".

Aim: we want to forecast for a given set of explanatory variables (weather, type of sports event, location, etc.) the expected violence (low, medium, high) depending on the number of "security guards" that are present at the game.

Problem: The historical data is of course highly correlated: the number of security guards is in some measure proportional to the violence, and also very related to the other variables (as they probably used some safety experts to evaluate the danger). Using the non-collinearity assumption to make a naive bayes seems wrong.

Question: What is a correct way to forecast the violence depending on the number of guards present during the event?

My guess: I should "discretize" the number of guards in 3-4 "bins" (e.g. few, some, many, a lot) to remove some of the correlation, and use the corresponding training sets to forecast the output violence depending on the input variables. But I lose a lot of information by training only on a subset of my data.

presumably, your aim is to forecast the expected violence , so you can allocate the correct number of guards? Your post suggests you have data on "how many people are being violent", i.e. you have a label to train against. I would suggest leaving the number of guards present out entirely, as if you wanted to use your solution in future, you wouldn't have access to this variable (in essence, this is the output of a previous model, i.e. a human decision maker, and you're trying to make an improved model). You should base the number of guards to send to a match on the output of your new model — gazza89, Sep 19 '18 at 13:42

score 7 · Accepted Answer · answered Jul 09 '13 at 12:19

7

I disagree with discretizing to get rid of collinearity. It doesn't get rid of it, it just pushes it under a rug where it can cause problems while being less visible.

"Number of guards" seems like a mediating variable. There is a lot of recent work on mediators, much of it by MacKinnon and his colleagues. E.g. this book but he has also written articles and has a website (Googling will find lots of things).

answered Jul 09 '13 at 12:19

Peter Flom

94,055
35
143
276

Is it a problem that the number of security guards is not randomly distributed but highly correlated to the input variables? – ATN Jul 10 '13 at 10:29
That is the nature of mediators. – Peter Flom Jul 10 '13 at 12:18

score 0 · Answer 2 · answered Jul 11 '13 at 19:58

0

Well what about a) building a model to predict the number of guards, n_act, call its output n_est. b) build model to predict violence based on inputs and (actual guards-estimated, n_act - n_est)

answered Jul 11 '13 at 19:58

seanv507

4,305
16
25

Logistic Regression/Naive Bayes with highly correlated data

2 Answers2