I'm using the following code in R to predict votes (e.g. non-negative integer count data).
m1<-glm(votes~.,data=trainset,family=poisson(link="sqrt"))
pred1<-predict(m1,newdata=testset,type="response")
> pred1[1:10]
3 4 5 6 7 8 11 12 14 15
0.8000618 1.4012718 0.9924539 0.9260005 0.2820739 0.3333504 0.3238205 0.5786863 1.1216740 1.0114024
> str(pred1)
Named num [1:14999] 0.8 1.401 0.992 0.926 0.282 ...
- attr(*, "names")= chr [1:14999] "3" "4" "5" "6" ...
Q1: I don't understand the structure of the predicted output. That is, isn't pred1
simply a numeric vector? Why does each element in pred1
have a name? For example, the first element 0.8000618
has a name of 3
, and so on. What's the purpose of that?
Q2: I get better results using link="sqrt"
than I do with link="log"
when computing model m1
. I think what this link setting does is model the response as either sqrt(prediction) or log(prediction) as opposed to simply prediction. So the better fit with square root is saying the data is better modeled by such an equation. Is there anything deeper than that going on? Also, although R says link="identity"
should be possible, it always gives an error Error: no valid set of coefficients has been found: please supply starting values
, is the problem with me or R?