0

I'm working on a small project where I need to create a multivariate linear regression model to predict the frequency of some airline companies. I'm a bit confused as I don't know if I have to remove the intercept because its Pr(>|t|) had, after removing the first variable dist, the biggest value among the other values. Here is what I get after removing dist:

flights_lm = lm(freq~dist+capa+nbrt+depf+lcco+prbi)
summary(flights_lm)
##################################################################
# > summary(flights_lm)
# 
# Call:
#   lm(formula = freq ~ dist + capa + nbrt + depf + lcco + prbi)
# 
# Residuals:
#   Min      1Q  Median      3Q     Max 
# -204884  -12347    1145   12382  297908 
# 
# Coefficients:
#               Estimate  Std. Erro  t value Pr(>|t|)    
# (Intercept)  1.857e+04  1.487e+04   1.248  0.21437    
# dist        -5.145e+00  6.729e+00  -0.765  0.44610    
# capa        -7.928e+01  6.540e+01  -1.212  0.22784    
# nbrt         7.665e+01  7.188e+00  10.663  < 2e-16 ***
# depf         3.408e-05  1.204e-05   2.832  0.00546 ** 
# lcco         3.531e+04  2.151e+04   1.642  0.10339    
# prbi         4.084e+00  2.017e+01   0.203  0.83988    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1                                      
#                                                                                                     
# Residual standard error: 60280 on 116 degrees of freedom                                            
# Multiple R-squared:  0.8719,  Adjusted R-squared:  0.8653                                           
# F-statistic: 131.6 on 6 and 116 DF,  p-value: < 2.2e-16                                             
#####################################################################

flights_lm2 = update(flights_lm, .~. -prbi)
summary(flights_lm2)
####################################################################
# Call:
#   lm(formula = freq ~ dist + capa + nbrt + depf + lcco)
# 
# Residuals:
#   Min      1Q  Median      3Q     Max 
# -204913  -12471    1098   12201  297917 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.918e+04  1.450e+04   1.323  0.18854    
# dist        -5.132e+00  6.701e+00  -0.766  0.44528    
# capa        -7.813e+01  6.488e+01  -1.204  0.23093    
# nbrt         7.665e+01  7.158e+00  10.708  < 2e-16 ***
# depf         3.406e-05  1.199e-05   2.842  0.00529 ** 
# lcco         3.506e+04  2.138e+04   1.639  0.10382    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 60030 on 117 degrees of freedom
# Multiple R-squared:  0.8719,  Adjusted R-squared:  0.8664 
# F-statistic: 159.3 on 5 and 117 DF,  p-value: < 2.2e-16
#####################################################################

flights_lm3 = update(flights_lm2, .~. -dist)
summary(flights_lm3)
#####################################################################
# Call:
#   lm(formula = freq ~ capa + nbrt + depf + lcco)
# 
# Residuals:
#   Min      1Q  Median      3Q     Max 
# -206975  -12147    4077   11489  297630 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.526e+04  1.355e+04   1.127  0.26212    
# capa        -9.031e+01  6.279e+01  -1.438  0.15303    
# nbrt         7.705e+01  7.127e+00  10.811  < 2e-16 ***
# depf         3.302e-05  1.189e-05   2.778  0.00637 ** 
# lcco         3.329e+04  2.122e+04   1.569  0.11939    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 59920 on 118 degrees of freedom
# Multiple R-squared:  0.8713,  Adjusted R-squared:  0.8669 
# F-statistic: 199.6 on 4 and 118 DF,  p-value: < 2.2e-16
################################################################
Nick Cox
  • 48,377
  • 8
  • 110
  • 156
nidabdella
  • 169
  • 1
  • 12
  • It's the other way round in my view. You need really good grounds to delete an intercept, i.e. forcing a model through the origin needs strong independent justification. The significance test for an intercept is something you can usually ignore. You may have different problems with this model, especially if it ever predicts negative frequencies. Poisson regression is a more natural first port of call for a counted response. – Nick Cox May 29 '15 at 22:50
  • The descriptor "multivariate" in "multivariate regression" is better reserved for regression models with multiple responses. You have a single response. It is no longer necessary even to say multiple regression. Having several predictors is routine, not exceptional. This is just a regression model (although the previous comment points you to Poisson regression). – Nick Cox May 29 '15 at 22:53
  • Thanks a lot for your answer, and for the advice. I really forgot about the fact that i might get negative frequencies. – nidabdella May 29 '15 at 23:05
  • See also http://stats.stackexchange.com/q/102709/17230 – Scortchi - Reinstate Monica Jun 01 '15 at 11:35

0 Answers0