Interpreting glm model output, assessing quality of fit

Question

I have a certain knowledge in stochastic processes (specially analysis of nonstationary signals), but in addition to be a beginner in R, I have never worked with regression models before. Well, I have some doubts on understanding the outcome of the function summary() in R, when using with the results of a glm model fitted to my data. Well, suppose I used the following command to fit a generalized linear model to my data:**

glm_model <- glm(Output ~ (Input1*Input2) + Input3 + Input4, data = mydata)

Then I use summary(glm_model) to obtain the following:

Call: 
glm(formula = Output ~ (Input1*Input2) + Input3 + Input4, data = mydata)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-7.4583  -0.8985   0.1628   1.0670   6.0673  
Coefficients:

Estimate Std. Error t value Pr(>|t|)    
(Intercept)        8.522e+00  6.553e-02 130.041  < 2e-16 ***
Input1            -3.819e-04  3.021e-05 -12.642  < 2e-16 ***
Input2            -2.557e-04  2.518e-05 -10.156  < 2e-16 ***    
Input3            -3.202e-02  1.102e-02  -2.906  0.00367 **     
Input4            -1.268e-01  7.608e-02  -1.666  0.09570 .      
Input1:Input2      1.525e-08  2.521e-09   6.051 1.53e-09 ***    
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 2.487504)
    Null deviance: 18544  on 5959  degrees of freedom
Residual deviance: 14811  on 5954  degrees of freedom
  (1708 observations deleted due to missingness)
AIC: 22353
Number of Fisher Scoring iterations: 2

From a estimation theory perspective, I understand that "estimate" and "Std. Error" are the estimates and the standard deviation of the unknown parameters (beta1, beta2,...) of my model. However, there are some things I do not understand:

How can I assess how good my fit is from the output of summary()? We could not use only the information of the standard deviation of the parameter estimators to assess the goodness-of-fit. I would expect to have access to the sampling distribution of a given parameter estimator to know the % of estimates within +- 1std, +-0.5std or any +-x*std, for example. Other option would be knowing the theoretical distribution of the parameter estimator, so as to try to calculate its Cramer Rao Lower Bound and compare with the calculated std.
What does the t value (or Pr(>|t|) ) have to do with the goodness-of-fit? Since I am not familiar with regression models, I do not know the connection between the student t distribution and the estimation of the model parameters. What does it mean? Is the parameter estimator of the glm model distributed according to the student t pdf (like the sample estimator for small samples of an unknown population)? What conclusions should I take from Pr(>|t|)?
Do we have a more general form of assessing the goodness-of-fit, like a measure of the variability of the data my model can capture, maybe a table of critical values for such a measure given a certain significance level?**
When fitting a glm model, do we need to specify a significance level? If yes, why such an information is not provided by the summary function?
The summary function outputs some measures based on information theory, like AIC: 22353. Can we define an optimal reference value for AIC? What is a good AIC value? My intuition is that we could not do so, like other information theory measures (mutual information, entropym,...)

Related: [Interpretation of R's lm() output](https://stats.stackexchange.com/q/5135/7290). — gung - Reinstate Monica, May 10 '19 at 13:16

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

First, as you have little prior experience with regression models, I would suggest that you obtain two freely available references. An Introduction to Statistical Learning covers linear regression and some examples of generalized linear models in a usefully broad context. Practical Regression and Anova using R, by Faraway, is more specifically focused on some of the questions you have.

Second, the glm model you presented seems to be equivalent to a standard linear regression model as usually analyzed by lm in R. The output of summary from an lm result might be more useful if your problem is a standard linear regression. glm is used for models that generalize linear regression techniques to "Output" or response variables that, for example, are classifications or counts rather than continuous real numbers. The glm summary may omit some types of lm summary values that are not properly provided by these generalized models, but it does provide the AIC value that is appropriate for models fit by the maximum-likelihood approach that glm uses.

Third, you need to be aware of an important distinction between different meanings of "goodness-of-fit." One meaning, captured readily from the output from lm, is how well the model fits the particular sample of data that you have. Depending on your application, however, you might be more interested in how well the model will generalize to new data samples. For that latter interest you will have to combine regressions with techniques like bootstrapping or cross-validation.

Fourth, as you have listed your predictor variables as "Input" and your outcome variable as "Output," you might be analyzing time-series variables. In that case more specialized techniques may be required to take into account issues like trends and autocorrelations. See this Cross Validated page as one place to start.

Now for your questions:

The summary of an lm model includes an "Adjusted R-squared" value that is a simple summary of overall goodness of fit; it's essentially a measure of the fraction of overall variance that the model accounts for, with a correction for the number of variables that the model fits. That, however, is insufficient for testing the validity of a linear regression. For that you need to evaluate whether residual errors are relatively independent of fitted values, whether particular data points are unduly affecting the results, and so on. A plot of an lm model is a good way to start. The Faraway reference noted above goes into some detail. (See below for confidence intervals.)
The estimated regression coefficients, under the usual assumptions of linear regression, follow a Student t distribution. The probabilities listed in the summary specify how frequently a coefficient of that magnitude would be found by chance, if the coefficient were truly 0 with that standard error of estimation. The standard errors can be used to set up confidence intervals for the coefficients (question 1), as the Faraway reference demonstrates.
This is essentially covered in the answer to (1) above. I caution you to pay less attention to general measures of goodness-of-fit and more attention to the more detailed tests noted above that document whether the linear model is even a reasonable fit to begin with.
In standard frequentist statistical testing, threshold p-values are pre-specified and those cases that pass that threshold are deemed "significant." If you had pre-specified p < 0.05, then all coefficients except for Input4 would be considered significantly different from 0, and the interaction term between Input1 and Input2 would also be significant, based on the t-tests noted in the answer to (2).
AIC values are useful for comparing among different models of the same data. The Wikipedia page explains it well, and the Faraway reference also explains it in the context of choosing among linear regression models. AIC is a measure of the likelihood (in a technical sense) of the model, corrected for the number of parameter values fit by the model. As for any such measure, what's a "good" AIC depends heavily on the subject matter; what might be spectacularly good for a clinical study would be terrible for particle physics. Some software reports AIC values without constant terms that can be ignored in model comparisons where only differences in AIC matter. Thus I would suggest that you not trust AIC values reported by a particular statistical package to be "true" AIC values unless you know the package very well.

Thank you very much! Actually, I have initially tried to fit a simple linear model to my data by using the lm() function, but after a quick verification I saw that the glm() function could give not only a larger value of correlation and mutual information between the output and fitted values, but residuals (i.e. fitted_values - output) that are less spread. — antamoeba, Dec 17 '15 at 15:43
I am not working with time series - I wish it was the case though, stationarity tests for discrete-time signals (specially for slowy-varying first-order nonstat., like trends) used to be my research topic :-) — antamoeba, Dec 17 '15 at 15:43
Actually, I need to fit a regression model to a data with inputs that are both numerical (e.g. height, weight, time spent in traffic,...) and categorical ones (e.g. day of week, type of vehicle,...), but my output is a real number (i.e. a grade between 0 and 100). I have read that when you have a mixture categorical and numerical inputs it is advisable to use glm models (right?), that was other reason I ended up choosing glm. I will check the references you provided. Thanks again for your help! — antamoeba, Dec 17 '15 at 15:43
@antamoeba : whether predictor variables are numeric or categorical does not matter for a choice of `lm` versus `glm`. What matters is the nature of the response variable and the distributions of error terms in the response. For example, count response data might be analyzed by `glm` as Poisson distributed, with variance assumed equal to mean values, rather than variance independent of mean values as `lm` would assume. Transformations of output variables (e.g., log or square root) can sometimes allow `lm` to work OK when assumptions aren't met in the original scale. — EdM, Dec 17 '15 at 19:46
@antamoeba : [This Cross Validated page](http://stats.stackexchange.com/q/181113/28500) provides more detail on `lm` versus `glm`. — EdM, Dec 17 '15 at 21:54

score 1 · Answer 2 · edited May 10 '19 at 11:31

Let me add some messages about the lm output and glm output.

About lm output, this page may help you a lot. It interprets the lm() function output in summary().
About glm, info in this page may help.
Additionally, AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth.
Trying to understand those tests and statistics is a way help in understanding the model.

Interpreting glm model output, assessing quality of fit

2 Answers2