3

Consider you have some nonlinear function

\begin{align} y_i&=\epsilon_i f(\beta,x_i) \end{align}

where $\epsilon_i$ is log-normally distributed with mean 1, and \begin{equation} f(x,\beta)=\frac{\beta_1x}{1+\beta_2x} \end{equation}

One way to estimate $\beta_1$ and $\beta_2$ is to do nonlinear regression on the log transformed data using the following model

\begin{align} \log(y_i)&=\log(f(\beta,x_i)) + e_i \end{align}

where $e_i$ is now normally distributed (with a mean slightly below zero). However, given this saturating $f$ the log transformation squashes data close to the asymptote $\beta_1 / \beta_2$, which based on me playing around with simulated data can lead to poor fits, where the data closer to the origin seems to be counting too much towards the fit.

Based on this question there seems to be a trade-off between positively skewed data and the ill-effects of the transformation. My question is just how positively skewed does the error have to be for the benefits of taking log transform to outweigh the drawbacks if the error is log-normal? What other methods exist when the transformation is leading to fits you can tell are poor just by eyeballing the data? Is it best just to do a minimizing SSE calculation on the initial model simply ignoring that your data is positively skewed?

WetlabStudent
  • 436
  • 3
  • 15
  • it was my bad, the title especially didn't help. I've played around with both minimizing RSS and also doing the same with the transformed data. Interestingly they both don't do that great with 29 data points and an 'unknown' log-normal std of 0.3. I suspect what is going on is that in addition to the transformation distorting the data, E[e]$ – WetlabStudent Jan 26 '15 at 01:44
  • If the assumptions are correct, using nonlinear least squares for the transformed fit is maximum likelihood, which generally should perform quite well. However there might be convergence issues. Can you show win more detail what the problem is that you say is occuring? How are start values obtained? Are the results stable using different start values? – Glen_b Jan 26 '15 at 04:02
  • Is it maximum likelihood if $E[e] \neq 0$? The mean of the log-normal variable is one, which is not the same as $\mu=1$. I should try and write down the likelihood, that shouldn't be too hard here. I have discovered that the problem was that I was choosing x values too close together. I imagine that is a problem for doing regression problems in general, but I am curious with regards to your maximum likelihood statement. – WetlabStudent Jan 26 '15 at 04:15
  • Sorry to have missed the point. The level will of course be biased by half the variance of the log of the error term ($\frac{1}{2}\sigma^2$); I think you can adjust for that. Is that all the problem is? – Glen_b Jan 26 '15 at 04:28
  • Yep it's biased by that amount, but the variance in theory is unknown. Is there a theoretically sound way to estimate it simultaneously while estimating the regression parameters? Perhaps iteratively - estimate the parameters, then the variance from the residuals, then re-estimate the parameters based on the bias calculated from that variance, repeat? I'm just testing the method on simulated data (where I set the variance =.3) before I look at the messier real data with unknown variance. – WetlabStudent Jan 26 '15 at 04:50
  • 1
    ideally, the idea is to maximize the likelihood of all the parameters simultaneously. Your iterative scheme should work for that, but you can always verify it (e.g. either by evaluating the (steepest direction of) the derivative at the end, or by taking small steps around the stopping point to check). There's also generic optimization functionality – Glen_b Jan 26 '15 at 05:08

0 Answers0