Negative fitted values in OLS regression

Question

I am running a regression where my dependent variable is a cross-section of variances. Therefore, I require my predicted values (fitted values) to be positive.

However, when running a simple OLS regression, a small percentage of my fitted values are negative, which is non-intuitive in this case (since variance cannot be negative).

Please note that approximately, my dependent variable is distributed according to a Chi-square distribution.

The output that I need from the regression are the fitted values in the original scale, as well a closed form expression of the MSE (Mean Square Error) of these fitted values.

Is there a way to impose a lower bound on the predicted values?

The solution suggested is very risky, as mentioned by one of the comments. I am looking for an alternative solution. — Mayou, Nov 25 '14 at 21:43
I considered GLM with log-link. However, I am not sure how to formulate the MSE (for the original $\hat{y}$) in the case of GLM with log-link. — Mayou, Nov 25 '14 at 22:04

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

I am running a regression where my dependent variable is a cross-section of variances. Therefore, I require my predicted values (fitted values) to be positive.

Then don't fit a model that doesn't obey such an obvious requirement...

However, when running a simple OLS regression,

... like, you know, OLS.

Please note that approximately, my dependent variable is distributed according to a Chi-square distribution.

Or rather, since population variances are usually not $1$, it should probably be approximately $\sigma^2$ times a chi-square -- so why not model it as, say a Gamma random variable (the distribution of a multiple of a chi-square)?

So why not use a GLM for this problem? All your fitted values are guaranteed to not go negative. See the example here (however, if you fit a straight line model, predicted values can - indeed, must - still go negative outside the data).

Is there a way to impose a lower bound on the predicted values?

If you fit a model for the mean such that the mean will remain positive (log-link, say, rather than identity-link) then out-of-sample predictions will obey the positivity restriction.

If you're modelling variances, the identity link usually won't make sense anyway. Choose one of the others, and the model - fitted and predicted - will stay positive.

Very thorough answer, thank you. So what is the advantage/difference between doing GLM with log link, as opposed to taking the log of the dependent variable, i.e. $log(\sigma^2) = X \beta + \epsilon$? — Mayou, Nov 26 '14 at 13:43

Aksakal · Answer 2 · 2014-11-26T00:39:59.060

1

The easiest way is to fit $z_t=\ln{y_t}$ instead of $y_t$. This way you get the fitted values as $\hat{y}_t=e^{\hat{z}_t}$, always positive.

UPDATE: If you are assuming errors are normal $\zeta_t\sim\mathcal{N}(0,\sigma_z)$, then using lognormal properties, you can get $\sigma_{y,t}=(e^{\sigma^2}-1)e^{2\hat{z_t}+\sigma^2}$. Here, MSE will depend on the fitted value. You have to think carefully what is MSE in this case. This is asymmetric distribution, and also because of the log it compresses errors for higher values.

Without normal error assumption, you can estimate MSE by obtaining residual values $\hat r_t=y_t-e^{\hat z_t}$, then computing the MSE $\sigma^2_y=Var[\hat r_t]$. This is assuming that variance is constant.

Here's another interesting reference: Granger, Newbold, The Journal of the Royal Statistical Society B 38, 1976, 189–203. "Forecasting Transformed Series", it's Chapter 23 here. The method I described here is called "naive" in this paper.

Note, also, that this is all relevant to forecasting. If your goal is not forecasting but analysis then it's a bit different story.

edited Nov 26 '14 at 00:39

answered Nov 25 '14 at 21:38

Aksakal

55,939
5
90
176

Two problems I have with this: if $y_t$ is not log-normal, I would be introducing a lot of bias by taking the log. The other issue is that I do not have a closed form expression of the MSE of $\hat{y_t}$ – Mayou Nov 25 '14 at 21:41
This approach suffers from the fact that $E[Y \vert X]=\exp(x'\beta) \cdot E[\exp(\varepsilon_i)]$. You're leaving out the second term. – dimitriy Nov 25 '14 at 21:43
Actually, to correct what @Aksakal said, it would be $\hat{y} = exp(\mu + \sigma^2/2)$ where $\mu$ and $\sigma^2$ are the conditional mean (fitted value) and residual variance of $log(y)$ – Mayou Nov 25 '14 at 22:02
@DimitriyV.Masterov Mayou your comments would have been correct if $\sigma$ was known. Unfortunately, it is usually unknown, hence the $e^{\hat{\sigma}^2/2}$ factor usually does not improve the forecast. So, usually it is not applied or should not be applied. References: 1) SAS Institute Inc. Forecasting Log Transformed Data, page 252. SAS/ETS 12.1 Users Guide. Cary, NC. 2) Helmut Ltkepohl and Fang Xu. The role of the log transformation in forecasting economic variables. Empirical Economics, 42(3):619{638, 2012. URL http://dx.doi.org/10.1007/s00181-010-0440-1 2012. – Aksakal Nov 25 '14 at 22:16
@Mayou, OLS does not require Normal distributed errors, it does not assume them. Hence, you can't say that $z_t$ is log-normally distributed. In fact, you probably don't know what is the error distribution. You wrote that you knew the distribution is like $\chi^2$, yet you applied OLS and didn't care that much. Why would you care now when it's log-transformed? – Aksakal Nov 25 '14 at 22:20
I agree on a few of these points. But let's assume we agree on all of them. The next item that I need is the MSE of the original $\hat{y}$. What is the formulation for that? – Mayou Nov 25 '14 at 22:25
This is what SAS USers Guide says, which I cited earlier: "The log transformation is often used to convert time series that are nonstationary with respect to the innovation variance into stationary time series. The usual approach is to take the log of the series in a DATA step and then apply PROC ARIMA to the transformed data. A DATA step is then used to transform the forecasts of the logs back to the original units of measurement. The condence limits are also transformed by using the exponential function." So, you get the confidence limits then exponent them. – Aksakal Nov 25 '14 at 22:27
Does that mean that this approximation holds: $MSE(\hat{y}) = exp(MSE(log(\hat{y})))$? – Mayou Nov 25 '14 at 22:29
@Mayou, not exactly. Let's say your MSE is $\sigma_z$, then confidence limits are $\bar{z}_t+\sigma_z$, so you convert them with $e^{\bar{z}_t+\sigma_z}$. – Aksakal Nov 25 '14 at 22:31
Well unfortunately, I don't need confidence interval. I need an explicit value for MSE. Is there a way of estimating it? – Mayou Nov 25 '14 at 22:32
You have a conceptual issue, unrelated to logs. What's is MSE for skewed distribution? – Aksakal Nov 25 '14 at 22:46
@Aksakal The transformation described in the SAS manual is for normal iid data. With time series, that is a problematic assumption, so it makes sense that it performs poorly. This is not time series data, so I am reasonably sure the Duan procedure (which only assumes iid) should perform better than just exponentiating. – dimitriy Nov 25 '14 at 23:55
The other reference (lutkepohl) is for economic data, I.e. time series. It's been noted long ago that the convexity adjustment doesn't work well in forecasting. – Aksakal Nov 26 '14 at 00:05
The normality assumption isn't required ... unless you want to do some inference (hypothesis tests, confidence intervals, most particularly prediction intervals). If you want to apply normal theory inference to those problems, you will be assuming normality to do them. – Glen_b Nov 26 '14 at 21:49

Negative fitted values in OLS regression

2 Answers2