GLM with empirical distribution

Question

If I understand GLM correctly, to run a GLM model I need to specify the particular transformation $f$ that ensures the conditional distribution of $f(Y)$ given $X$ is from the exponential family. (I also need to make sure $f$ is one of the transformations that's easy to work with from the computational perspective.)

But let's say I don't have much confidence about which $f$ to use; all I have a large enough dataset that I can get a decent empirical distribution.

What would be a good approach in this situation?

EDIT: Will try to address @dsaxton and @glen_b comments. Let's say I have a dataset of second-by-second heart rate for many people over many workout sessions. And let's say I want to be able to predict a heart rate of a random person given the amount of time elapsed since the start of the workout, perhaps accounting for fixed effects in each person.

Generally the exponential family "assumption" is more of a working hypothesis than something that's actually believed literally to be true. You may want to give a bit more detail about the particular problem you're working on. — dsaxton, Mar 27 '16 at 01:11
What's the model *for*? One doesn't just "run a model" ... why estimate a model? — Glen_b, Mar 27 '16 at 03:31

score 1 · Accepted Answer · answered Mar 27 '16 at 03:28

You cannot hope to identify a transformation that guarantees a distribution is exponential family*; indeed, with discrete data, like count data, the notion of using transformation at all doesn't generally hold up.

* there's simply no way to know you succeeded.

Often what you would seek to do is choose a member of the exponential family that makes for a reasonable description of your data (especially the conditional mean and variance). In the case of continuous random variables you might consider a transformation as a first step but in many cases it may not be needed.

Often suitable mean and variance functions will suggest themselves (or more correctly, will arise from an understanding of the variables and other subject-matter knowledge). If a transformation is needed to get a suitable description of mean and variance it will also tend to be clear at this stage.

In the absence of any understanding of the variables or any kind of likely relationships between them (such as the conditional means and variances I mentioned earlier)* - no previous studies, no experts exist, no nothing - one is pretty much left with using the data to identify a model. To avoid problems of using the same observations to identify and estimate a model, you could consider pulling off a subset of the data to use in model identification, and then estimate on the remainder. If you need also to do (say) variable selection you may even want to consider splitting into more than two parts, or using cross validation after initial identification.

* complete absence of any information at all would seem bizarre, since usually we know before we collect information whether we're dealing with bounded variables for example.

GLM with empirical distribution

1 Answers1

Linked