I was wondering from a technical perspective what approach I should follow in this modelling problem I have.
I have a target variable Y
which is a continuous random variable defined in the interval [0; infinity). For this reason (and this is also verified by the data itself) I decided to use a tweedie distribution. Moreover, I would like to have a multiplicative model, so I am using a log link-function.
I also know that the variable Y
is linearly dependent on the time
variable. It is assumed that the more the time
, the higher the Y
value is.
Given these conditions I followed two different approaches:
- Modeling the variable directly and using
time
as a log offset. Following R syntax the model would look like the followingglm(Y ~ X1 + X2 + ... + offset(log(time)), family = tweedie(link = "log"))
- Modeling the ratio of
Y
andtime
and usingtime
as training weights. DefiningY_time = Y / time
we haveglm(Y_time ~ X1 + X2 + ..., weights = time, family = tweedie(link = "log"))
Which approach is more theoretically sound?