Does the dependent variable in a GLM have to be transformed before running the model or does the model do it?

Question

I'm trying to fit a Tweedie model with statsmodels and was wondering if I have to transform the dependent variable before I run the model or if statsmodels does that automatically?

My model is as follows:

lost_cost_model = smf.OLS(y, x, family = sm.families.Tweedie(link = sm.families.links.log, var_power = 1.5), weights = weights)

Should it be:

lost_cost_model = smf.OLS(np.log(y), x, family = sm.families.Tweedie(link = sm.families.links.log, var_power = 1.5), weights = weights)

Or does the family = sm.families.Tweedie(link = sm.families.links.log transform the variable in the background?

You should explain that `statsmodels` is a Python package available from github. You shouldn't assume that readers of this forum will be familiar with it. — Gordon Smyth, Apr 17 '19 at 04:07

Gordon Smyth · Answer 1 · 2019-04-20T08:15:56.227

One of the fundamental motivations for generalized linear models (GLMs) is that they model the data as it is instead of transforming it. So the answer to your question is that there is no transformation, either by you or by the model.

If you want to understand this better, you could consult my book on GLMs with Peter Dunn (Dunn and Smyth, 2018). We discuss transformations of the response variable in Section 3.9 and contrast them with the GLM approach. We discuss transformations of predictor variables in Section 3.10, but that's a separate issue. Tweedie GLMs are covered in Chapter 12. Also see Why is GLM different than an LM with transformed variable?

The particular Tweedie GLM you have defined (with var_power=1.5) has mass at zero (i.e., it has exact zeros) but is otherwise continuous on the positive real numbers (Smyth, 1996; Hasan et al, 2012). Think of rainfall as an example. Many days there is no rain at all (exact zero) but, if there is any rain, then the actual amount of rain is continuous and positive. If your data is not like that, then you shouldn't be using this type of Tweedie GLM.

The code that you show however is fundamentally wrong. You cannot mix ordinary least squares (smf.OLS) with GLM families (sm.families.Tweedie). Did you perhaps mean smf.glm instead of smf.ols? Otherwise I suggest that you ask on the statsmodels forum:

https://groups.google.com/forum/?hl=en#!forum/pystatsmodels

because I think you might be misunderstanding what the functions do.

By the way, the statmodels function sm.families.Tweedie is a Python implementation of the tweedie function in the statmod R package, available from CRAN. See Given a GLM using Tweedie, how do I find the coefficients? for example code.

References

Dunn, P. K., and Smyth, G. K, (2018). Generalized linear models with examples in R. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-0118-7

Hasan, M.M. and Dunn, P.K. (2012). Understanding the effect of climatology on monthly rainfall amounts in Australia using Tweedie GLMs. International Journal of Climatology, 32(7), pp.1006-1017.

Smyth, G. K. (1996). Regression modelling of quantity data with exact zeroes. Proceedings of the Second Australia-Japan Workshop on Stochastic Models in Engineering, Technology and Management. Technology Management Centre, University of Queensland, pp.572–580. http://www.statsci.org/smyth/pubs/RegressionWithExactZerosPreprint.pdf

Excellent. Worth mentioning that there may be grounds for transforming predictors, which is separate territory? — Nick Cox, Apr 17 '19 at 06:49
Thanks and great references. I’m going to amazon to grab your book. I’ve done this in R. But R is having issues with the amount of data for this project so I’m trying to bring into python to see if it’s any better. Ultimately I have to take this model and transform it into a piece wise model on some of my variables. Again, thank you. — Jordan, Apr 17 '19 at 12:34
I know the Statsmodels GLM implementation pretty well. I would not expect it to be any more memory-efficient than what is provided in (base) R. There is some support for distributed estimation but I don't think it has been extensively tested: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/distributed_estimation.py — Kerby Shedden, Apr 17 '19 at 19:39
@Jordan I don't have any trouble using R on my laptop to fit glms with 100 million responses. — Gordon Smyth, Apr 18 '19 at 00:44
Wow. I sure did. I keep getting memor errors with 32gigs of memory. Only 1 million observations with about 115 columns. Thanks again. Just downloaded your book. — Jordan, Apr 18 '19 at 00:49
@Jordan Make sure that you're using 64bit R. Hope you find the book useful. — Gordon Smyth, Apr 18 '19 at 03:36
I have found the differences between these two approaches rarely or poorly discussed in what I have read, so I too ordered a copy of your book. — Nick Cox, Apr 19 '19 at 08:16
@NickCox Originally we devoted a *lot* of space in the book comparing transformations to glms. That got trimmed a bit in the final version but we still cover transformations and variance stabilisation in detail (Box-Cox, ladder of powers etc) before moving on to glms. Hope you find it useful. — Gordon Smyth, Apr 28 '19 at 23:37

Kerby Shedden · Accepted Answer · 2019-04-17T12:36:33.257

Below is a small working example. As noted, you want GLM not OLS. Statsmodels (usually) silently ignores extra keyword parameters, so your code may run, but if you look at the summary output it would have been clear that you were fitting a linear model, not a GLM.

import statsmodels.api as sm                                                                                                     

y = [3, 1, 5, 0, 1, 4, 0, 2, 5]                                                                                                  
x = [2, 1, 3, 2, 2, 5, 3, 2, 6]                                                                                                  
x = [[1, u] for u in x] # Put in an intercept                                                                                                        

model = sm.GLM(y, x, family=sm.families.Tweedie(var_power=1.5))                                                                  
result = model.fit()
print(result.summary())

Does the dependent variable in a GLM have to be transformed before running the model or does the model do it?

2 Answers2