Is there any reason why one cannot create a time series from variables calculated from regressions on cross sections?

Question

I am asking this question as the textbooks that I have don't specifically address the topic of creating a time series. If you have an answer, or even links to articles that I can research myself, it would be much appreciated.

Suppose you had thirty cross sections of data that are large enough for the asymptotic properties of OLS to apply. The cross sections are random samples of data collected each year for thirty years. As an example, let's say you want to analyse the income gap between men and women in a VAR or ECM model through time with a couple of other factors. Is there any reason why one could not run OLS regressions on each cross section to calculate estimates for the income gap (I am using the term "gap" to refer to male wage as a proportion of female wage), and then use then form a time series variable with these estimates for VAR or ECM analysis?

This is assuming that I am able to correctly specify the relevant functional form in the cross section regressions.

Are we to understand that the "income gap" will be the dependent variable in these $30$ cross-sectional regressions? — Alecos Papadopoulos, Apr 14 '17 at 18:56
No - earnings will be the dependent variable in the cross section regressions. It will be log-log form, and the income gap will be measured using the coefficient on a dummy variable. So my question is; if the asymptotic properties of OLS hold, would there be anything preventing me from using these coefficients in a time series analysis (either a VAR or ECM model). — Alexander Whyte, Apr 14 '17 at 19:13
The economic model that I have derived would suggest that the income gap should be jointly determined with other variables through time. I'd like to analyse the relationship with a VAR preferably, but as I would be deriving one of the variables, I am just cautious to do so. — Alexander Whyte, Apr 14 '17 at 19:17

Alecos Papadopoulos · Accepted Answer · 2017-04-14T20:02:14.747

The justification to use a regression specification is to measure the "income gap" after controlling for various confounding aspects.

By running separate cross-sectional regressions one entertains the possibility that this income gap changes through time (and so it cannot be represented by a constant coefficient over time, which would justify using the whole of panel data together).

The fact that the cross-sectional dimension is large provides comfort that any variation over-time observed in the resulting time series of coefficients is not due to sample error/variation but largely reflects existing structural relationships.

So what the OP thinks of doing can in principle be considered a valid approach.

The general issue of using "derived" variables (in the subsequent analysis), is no different to when we use Instrumental Variables estimation -there too, we are using estimated variables as regressors (in the cases where instead of just substituting the instrument for the original regressor, we regress the regressor on the instruments and use its estimated value).

By the way, having panel data but running many cross-sectional regressions instead of a single pooled one, looks like having some similarities with the "dummy variable method" and its "adjacent-period" variant, used in Hedonic Price Analysis (see for example the overview book by Jack Triplett, chapter 3).

Matthew Gunn · Answer 2 · 2017-04-14T21:23:27.273

Yes, you could.

And for your own edification, what you're suggesting is highly related to the Fama-Macbeth procedure to obtain consistent standard-errors in panel data with cross-sectional correlation.

Relation to Fama-Macbeth:

Loosely speaking, monthly stock returns are:

hugely correlated cross-sectionally
(close to) independent across time

Fama and Macbeth wanted to estimate the relationship between a portfolio's average returns (over time) and a portfolio's covariance with the market. The problem is, how can you get the standard errors that are consistent in the presence of cross-sectional correlation?

Today, we might cluster by time but what they did back in the 1970s is:

For each time period $t$, run a cross-sectional regression of $r_{it} = a_t + b_t x_{it} + \epsilon_{it}$ to get $\hat{b}_t$
Estimate $\hat{b} = \frac{1}{T} \sum_t \hat{b}_t$ and $\operatorname{SE}(\hat{b}) = \sqrt{\frac{1}{T}\hat{\operatorname{Var}}(\hat{b_t})}$

Since $\{\hat{b}_t\}$ is an IID series (because each time-period is basically independent), you can apply your standard statistics 1 techniques to the series $\{\hat{b}_t\}$.

Is there any reason why one cannot create a time series from variables calculated from regressions on cross sections?

2 Answers2

Relation to Fama-Macbeth: