I have a certain knowledge in stochastic processes (specially analysis of nonstationary signals), but in addition to be a beginner in R, I have never worked with regression models before. Well, I have some doubts on understanding the outcome of the function summary() in R, when using with the results of a glm model fitted to my data. Well, suppose I used the following command to fit a generalized linear model to my data:**
glm_model <- glm(Output ~ (Input1*Input2) + Input3 + Input4, data = mydata)
Then I use summary(glm_model) to obtain the following:
Call:
glm(formula = Output ~ (Input1*Input2) + Input3 + Input4, data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-7.4583 -0.8985 0.1628 1.0670 6.0673
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.522e+00 6.553e-02 130.041 < 2e-16 ***
Input1 -3.819e-04 3.021e-05 -12.642 < 2e-16 ***
Input2 -2.557e-04 2.518e-05 -10.156 < 2e-16 ***
Input3 -3.202e-02 1.102e-02 -2.906 0.00367 **
Input4 -1.268e-01 7.608e-02 -1.666 0.09570 .
Input1:Input2 1.525e-08 2.521e-09 6.051 1.53e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 2.487504)
Null deviance: 18544 on 5959 degrees of freedom
Residual deviance: 14811 on 5954 degrees of freedom
(1708 observations deleted due to missingness)
AIC: 22353
Number of Fisher Scoring iterations: 2
From a estimation theory perspective, I understand that "estimate" and "Std. Error" are the estimates and the standard deviation of the unknown parameters (beta1, beta2,...) of my model. However, there are some things I do not understand:
How can I assess how good my fit is from the output of
summary()
? We could not use only the information of the standard deviation of the parameter estimators to assess the goodness-of-fit. I would expect to have access to the sampling distribution of a given parameter estimator to know the % of estimates within +- 1std, +-0.5std or any +-x*std, for example. Other option would be knowing the theoretical distribution of the parameter estimator, so as to try to calculate its Cramer Rao Lower Bound and compare with the calculated std.What does the t value (or Pr(>|t|) ) have to do with the goodness-of-fit? Since I am not familiar with regression models, I do not know the connection between the student t distribution and the estimation of the model parameters. What does it mean? Is the parameter estimator of the glm model distributed according to the student t pdf (like the sample estimator for small samples of an unknown population)? What conclusions should I take from Pr(>|t|)?
Do we have a more general form of assessing the goodness-of-fit, like a measure of the variability of the data my model can capture, maybe a table of critical values for such a measure given a certain significance level?**
When fitting a glm model, do we need to specify a significance level? If yes, why such an information is not provided by the summary function?
The summary function outputs some measures based on information theory, like AIC: 22353. Can we define an optimal reference value for AIC? What is a good AIC value? My intuition is that we could not do so, like other information theory measures (mutual information, entropym,...)