Is it problematic to use the ratio of two measurements as a response variables in linear regression?
Motivating Example: Modeling Differences
My motivation comes from the following understanding of differences of two measurements as response variables:
To understand how something changes over time using two measurements in series, people often model as:
(T2i - T1i) = b0 + ei
And by rearrangement equivalent to:
T2i = b0 + 1.0 * T1i + ei
So, the relationship between T1 and T2 above is fixed at 1:1
Better would be:
T2i = b0 + b1 * T1i + ei
so that the relationship between T1 and T2 is estimated, not assumed. If indeed the relationship is not 1:1, this model will be better than the difference model.
My Question: What About Ratios?
Suppose we are analyzing the quantities of two parts that make a whole as a ratio, or a density in which we divide count data by effort data. So ...
(P1i / P2i) = b0 + ei
By rearrangement:
P1i = P2i * (b0 + ei)
If we distribute:
P1i = b0 * P2i + ei * P2i
What (on Earth) does this assume?
Is it problematic?
What if we have a linear predictor in the model? We end up with:
P1i = b0 * P2i + b1 * X1i * P2i + ei * P2i
Is it now true that b0 is an estimate of the relationship between P2 and P1 and that b1 is an estimate of the interactive effect of X1 and P1? Shouldn't be, because now there is no intercept, and no estimate for the effect of X1 is made. And what of the error term being multiplies by P2 -- that can't be good ... ?
A better model would appear to be:
P1i = b0 + b1 * P2i + b1 * X1i + ei
Now we are predicting P1 controlling for P2. Don't we get the same information here?
It is more clear that we do get the information we're looking for if thinking instead about density, where P1 is, say, the number of objects observed, and P2 is the effort spent sampling for them. To control for effort makes perfect sense, but is it equally valid to do division to create the response variable?