Dependent variable - number of visitors to a historical monument by day
Independent variables - Daily average temperature, relative humidity, number of tourists visiting the state by day, etc.
My task is to understand the key drivers that influence the number of visitors. So far, I have done the following:
1) Fitted a multiple regression in R with LN(Number of Visitors) ~ Independent variables using MASS package lm() function. I transformed some of the independent variables too per recommendation from BoxCoxTrans() from caret package. The resulting regression diagnostic look pretty decent to me. The R-square was approximately 25%, which is satisfactory to me, given the data that I have.
2) I have also tried fitting a glm.nb() function from MASS package because the dependent variable showed over-dispersion per a test for over-dispersion. The resulting regression diagnostic look pretty decent to me.
The residuals are pretty much well-behaved in both cases, given that it's a real world data. However, the results from the two models are vastly different in terms their respective coefficients of determination, e.g., increase in temperature by 1 degree causes increase in the number of visitors by 10% per the regression model and 30% per the GLM model with Poisson or quasi-Poisson distribution.
I would like to cross-validate with the community to make sure that I am not using an inappropriate techniques for the type data I have and which one of the techniques is more suited for the given data. Thank you!
Output from the lm() is as follows:
Call:
lm(formula = CT ~ Review + MinTemp + RH + Delta + xRate + PercIntl + PercOnline + CSI, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-2.31465 -0.57769 -0.03228 0.56113 2.96008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.150803 0.009828 -15.344 < 2e-16 ***
Review 0.103383 0.009788 10.562 < 2e-16 ***
MinTemp -0.275583 0.012636 -21.809 < 2e-16 ***
RH 0.190549 0.011313 16.844 < 2e-16 ***
DeltaMax 0.030461 0.010626 2.867 0.00416 **
xRate 0.181127 0.013951 12.983 < 2e-16 ***
PercIntl 0.318809 0.010610 30.049 < 2e-16 ***
PercOnline -0.212168 0.011827 -17.939 < 2e-16 ***
CCI -0.080672 0.011022 -7.319 2.79e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8085 on 6855 degrees of freedom
(1847 observations deleted due to missingness)
Multiple R-squared: 0.2495, Adjusted R-squared: 0.2486
F-statistic: 284.8 on 8 and 6855 DF, p-value: < 2.2e-16
Output from the glm() is as follows:
Call:
glm(formula = CT ~ Reviews + Delta + RH + xRate +
PercOnline + PercIntl + CSI + Temp, family = quasipoisson(),
data = tp, subset = !selector)
Deviance Residuals:
Min 1Q Median 3Q Max
-46.404 -12.484 -5.329 5.063 100.196
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.7583433 1.0223177 -3.676 0.000239 ***
Review -0.0201375 0.0010352 -19.453 < 2e-16 ***
DeltaMax 0.0063672 0.0015213 4.185 2.89e-05 ***
RH 0.0019643 0.0006838 2.873 0.004083 **
xRate 0.1009589 0.0082975 12.167 < 2e-16 ***
PercOnline -0.0233884 0.0012857 -18.192 < 2e-16 ***
PercIntl 0.0148912 0.0011250 13.236 < 2e-16 ***
CSI -0.0068745 0.0009345 -7.356 2.13e-13 ***
Temp 0.2362620 0.0149763 20.428 1.39e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 312.9381)
Null deviance: 2145768 on 6303 degrees of freedom
Residual deviance: 1591885 on 6295 degrees of freedom
(331 observations deleted due to missingness)
AIC: 67908
Number of Fisher Scoring iterations: 5
Analysis of Deviance Table (Type II tests)
Response: CT
LR Chisq Df Pr(>Chisq)
Review 372.02 1 < 2.2e-16 ***
Delta 17.47 1 2.912e-05 ***
RH 8.24 1 0.004103 **
xRate 150.20 1 < 2.2e-16 ***
PercOnline 337.02 1 < 2.2e-16 ***
PercIntl 163.27 1 < 2.2e-16 ***
CSI 53.68 1 2.362e-13 ***
Temp 42.02 1 9.052e-11 ***