Background: We work with data from sports event, more accurately with data about the spectators of sports events: how many people are being violent, what kind of event is this, etc. We have quite a lot of data from the past few years and we try to find the "right" number of security people we need to minimize the violence while keeping some kind of "budget".
Aim: we want to forecast for a given set of explanatory variables (weather, type of sports event, location, etc.) the expected violence (low, medium, high) depending on the number of "security guards" that are present at the game.
Problem: The historical data is of course highly correlated: the number of security guards is in some measure proportional to the violence, and also very related to the other variables (as they probably used some safety experts to evaluate the danger). Using the non-collinearity assumption to make a naive bayes seems wrong.
Question: What is a correct way to forecast the violence depending on the number of guards present during the event?
My guess: I should "discretize" the number of guards in 3-4 "bins" (e.g. few, some, many, a lot) to remove some of the correlation, and use the corresponding training sets to forecast the output violence depending on the input variables. But I lose a lot of information by training only on a subset of my data.