I'm working with a simple univariate dataset and I've built several models for it. Some I think are fairly decent given that datas structure. In order to get a decent model I had to do some transformations and I guess I'm just looking to find out if I'm thinking of this data and modeling in the correct way?
y = revenue
x = total touchpoints
scatter plot of the data:
to me the data in its current form is not suitable for modeling so my first though was to take a log transformation of each variable because the data looks really heteroskedastic.
fitting a linear model to the log transformed data looks like this:
the diagnostics look pretty good certainly not perfect:
the model summary:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.1034 0.1169 86.39 <2e-16 ***
log(total) 0.4304 0.0387 11.12 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.364 on 717 degrees of freedom
Multiple R-squared: 0.1471, Adjusted R-squared: 0.146
F-statistic: 123.7 on 1 and 717 DF, p-value: < 2.2e-16
Thinking about the data a bit more I tried a sqrt transformation on both revenue and total:
both of those histograms look alot like gamma distributions to me so I proceeded to build a glm model with a gamma family and log link. Here is the model:
The model diagnostics seem to look much better than the log transformed linear model:
The model summary looks like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.412646 0.047961 112.86 <2e-16 ***
sqrt(total) 0.092751 0.008491 10.92 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.4765069)
Null deviance: 374.02 on 718 degrees of freedom
Residual deviance: 312.53 on 717 degrees of freedom
AIC: 9605.7
The gamma model has a pseudo R-squared below which is not much different than the linear model:
> 1 - (312.53/374.02)
[1] 0.164403
And a goodness of fit test says the model fits adequately:
> 1 - pchisq(312.53/0.4765, 717)
[1] 0.9499516
Ok so I tried a few other models but you get the point. What I'm really curious about is if i'm approaching this problem correctly does my process seem correct or am I'm completely overlooking important information in the modeling process? Any comments or advice would be greatly appreciated.