2

I have a dataset consisting of about 600 observations. Each observation has around 100 attributes. One of the attributes I want to predict. Since the attribute that I want to predict can only have non-negative integer values, I was looking for ways to predict count data and found that there are various options, such as Poisson regression or negative binomial regression.

For my first try I used negative binomial regression in R:

#First load the data into a dataset
dataset <- test_observations[, c(5:8, 54)]

#Create the model
fm_nbin <- glm.nb(NumberOfIncidents ~ ., data = dataset[10:600, ] )

I then wanted to see how to predicted values look like:

#Create data to test prediction
newdata <- dataset[1:10, ]

#Do the prediction
predict(fm_nbin, newdata, type="response")

Now the problem is the output looks like this:

     1         2         3         4         5         6         7         8         9        10 
0.2247337 0.2642789 0.2205408 0.2161833 0.1794224 0.2081522 0.2412996 0.2074992 0.2213011 0.2100026 

The problem with this is that I expected that the predicted values are integers, since that is the whole purpose of using a negative binomial regression. What am I missing here?

Furthermore, I would like to evaluate my predictions in terms of mean squared error and mean absolute error, as well as a correlation coefficient. However, I couldn't find a way to get these easily, without doing all the calculations manually. Is there any built-in function for this?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
user81675
  • 21
  • 1
  • 3
  • Welcome to our site! Have you looked at the [many related questions on interpreting negative binomial regression output](http://stats.stackexchange.com/search?q=interpret+negative+binomial)? – whuber Jul 07 '15 at 15:16
  • Yes I did, but I think those questions are about how to interpret AIC, BIC, or the different coefficients. What I want to now, if I would use this model to actually predict my data, why is the output a floating point number and not an integer? – user81675 Jul 07 '15 at 15:46
  • I find specific answers at http://stats.stackexchange.com/questions/48448, http://stats.stackexchange.com/questions/116007, http://stats.stackexchange.com/questions/60777, http://stats.stackexchange.com/questions/25440. Most explanations of logistic regression and GLMs generally will also apply, too. Your post actually asks an awful lot of questions implicitly, because it is based on several misconceptions--and these explanations address some of them. – whuber Jul 07 '15 at 16:18

1 Answers1

1

Don't forget that glm's model E[y|X]. Therefore, the predict function for glm.nb is giving you E[y|X]. The standard example of a non-integer expected value is rolling a die. You can only get integers as outcomes. However, the expected value (3.5) is not an integer.

As for the mean squared error, checkout Hans Roggeman's answer here. It helped me understand model comparison better.

user5292
  • 125
  • 9