Interpreting GLM regression analysis result

Question

I'm using the following code in R to predict votes (e.g. non-negative integer count data).

m1<-glm(votes~.,data=trainset,family=poisson(link="sqrt"))
pred1<-predict(m1,newdata=testset,type="response")
> pred1[1:10]
        3         4         5         6         7         8        11        12        14        15 
0.8000618 1.4012718 0.9924539 0.9260005 0.2820739 0.3333504 0.3238205 0.5786863 1.1216740 1.0114024 
> str(pred1)
 Named num [1:14999] 0.8 1.401 0.992 0.926 0.282 ...
 - attr(*, "names")= chr [1:14999] "3" "4" "5" "6" ...

Q1: I don't understand the structure of the predicted output. That is, isn't pred1 simply a numeric vector? Why does each element in pred1 have a name? For example, the first element 0.8000618 has a name of 3, and so on. What's the purpose of that?

Q2: I get better results using link="sqrt" than I do with link="log" when computing model m1. I think what this link setting does is model the response as either sqrt(prediction) or log(prediction) as opposed to simply prediction. So the better fit with square root is saying the data is better modeled by such an equation. Is there anything deeper than that going on? Also, although R says link="identity" should be possible, it always gives an error Error: no valid set of coefficients has been found: please supply starting values, is the problem with me or R?

Q1 is probably better posted on stackoverflow. For Q2 there is definitely more going on, I found these answers very helpful: http://stats.stackexchange.com/a/30909/22199, http://stats.stackexchange.com/questions/40876/difference-between-link-function-and-canonical-link-function-for-glm link="log" is the canonical link for the poisson model. — Alex, Nov 16 '15 at 22:47
@Alex I think there are enough underlying statistical issues here that make the overall question sufficiently on-topic, but it might be better split into two questions perhaps.. — Glen_b, Nov 17 '15 at 04:48

Glen_b · Accepted Answer · 2015-11-17T21:13:49.027

The labels on predict should be the row labels on testset that you supplied.
You describe the link as if it's transforming the data, it's not; the linear predictor is modelling the transformed mean; data are not transformed. It's certainly the case that one link function will fit better than another, but I wouldn't normally choose between them on the basis of comparing how well they fit. If you do so you need to take account of that aspect of the fitting because (like any model-selection) it impacts the distribution of estimates, standard errors, p-values, ...
The problem with the identity link is related to the fact that it can lead to predicted values (outside the range of the x's supplied) that are negative. When this happens the log-likelihood isn't nicely quadratic unless you're very close to the optimum. Links that result in "impossible" fits close to the data often seem to be associated with convergence issues.

Often you can get it to converge by starting it closer to the optimum than the default, or by playing with the convergence criteria (and other aspects of the control of the fitting), but in some cases nothing seems to help much.

The default glm fitter used by R's glm has one or two quirks that make it less likely to converge on those difficult cases than is possible with a small tweak to its behavior. [Identity links with Poisson models are one common bugbear for this.]

Fitting the glm via Ian Marschner's glm2 may do substantially better. He explains the issue in Marschner (2011)[1], and the package itself is available on CRAN.

[1]: Marschner, I. (2011),
"glm2 : fitting generalized linear models with convergence problems,"
The R journal, Vol. 3, Issue 2, p.12-15
https://journal.r-project.org/archive/2011-2/RJournal_2011-2_Marschner.pdf

Thanks @Glen_b for answering in detail. I tried `glm2` just now and it converges where `glm` previously did not. — user46688, Nov 17 '15 at 18:54

Interpreting GLM regression analysis result

1 Answers1

Linked