Screening data prior to PCA v. PLS

Question

I have a very large time series matrix $X$, where the number of observations (rows) $n$ is much smaller than the number of input variables (columns) $p$. My aim is to use the information in $X$ to forecast future values of some target variable $Y_{t+h}$ (where $h$ may equal 0 in the nowcast case).

The columns of $X$ are serially correlated and I've read in Bovin and Ng 2006 & Caggiano et al 2009 that factors drawn from a subset of $X$ perform better in real-world forecasts than factors generated using all of the columns of $X$.

This seems to be a practical consequence of the data we see in the real world -- though it creates some tension with the asymptotic theory (at least as I understand it).

The methods proposed in these papers to subset $X$ seem sensible enough, but I wonder why subset $X$ and do PCA and then regression if your objective is to forecast $Y_{t+h}$?

Why not employ PLS and use $Y_{t+h}$ to guide the extraction of factors from $X$?

Along the way, views on the best ways to pre-screen $X$ before PCA would also be helpful.

Well, the idea behind PCA is that it is used to estimate the _latent factors_ in the economy, and while the extracted signals are not necessarily tailored in favor of your $Y$, "noisy" variables from $X$ can be also useful for cleaning the true signal. Whether or not such signal helps forecasting some particular $Y$ is a different problem. If, say, $Y$ is a macro variable, in many cases it would be modelled through a system of equations, taking other macro variables as endogenous. In such case, the "less acurate" latent signals may perform much differently.. — runr, Feb 21 '20 at 12:36
@Nutle that is a good and subtle point. There really are two problems: 1/ identification of the underlying state of the economy (which may look a bit like gdp); and 2/ forecasting that underlying state. I’d be stoke to see answers to both problems. — ricardo, Feb 22 '20 at 07:54

ReneBt · Accepted Answer · 2020-02-21T11:24:04.793

I'm going to break down my answer in a different way.

1. when is PCA or PLS preferred?

PCA is an unsupervised data reduction, i.e. the data is compressed into its underlying components without any guidance from data external ($Y$) to the $X$ data. The top ranked components returned are those that dominate the variation in the $X$ data as it has been pre-processed. These may or may not be components that are related to something you are interested in. If they are related to what you are interested in then the components recovered will not be influenced by noise in the $Y$ data

PLS is a supervised data reduction, i.e. $Y$ data external to $X$ guides the compression of the $X$ data. This means the top ranked components are influenced by the variation in $Y$, including noise leading to a raised risk of overfitting in the data reduction process.

If $X$ contains dominant variation unrelated to $Y$ then the top ranked PLS models will have a higher correlation with $Y$ than the same ranks in the PCA model. Note that each component of real data mixes noise and signal, so the lower the rank the useful information appears at the more noise in the component. There is no clear cut decision between PCA and PLS.

PCA works at its own best when the variation of interest is dominant, but may still be outperformed by PLS as long as the signal to noise in $Y$ is good enough not to impute too much noise in the data reduction.
PLS will be better than PCA when the variation of interest is non-dominant but still accounts for a useful proportion of $X$ and also $Y$ has low noise.
The noisier $Y$ is the more risk of over-fitting to noise in $Y$ when doing PLS.
The lower the PCA rank of relevant information, the higher the risk of overfitting to noise in $X$ for PLS since lower ranked PCs tend to be noisier.

.

2. What is the value of pre-screening variables (AKA variable selection)?

Variable selection is used to identify those variables that benefit the final model most. Different variable selection methods have different assumptions about what that 'benefit' is. Variable suppression is related and differs in that variables deemed beneficial are suppressed rather than removed, but the motivation is very similar to variable selection.

Variable screening can filter out unstable variables that contribute little signal to noise to the dataset. This makes the final model less noisy and therefore more precise (spread or certainty of a point estimates) and often makes the model more accurate too (how close point estimates are to the expected answers). However, variable exclusion can limit the scope for outlier detection (particularly relevant for signal based data where channels empty for component of interest may give signal when contamination occurs).

As with data reduction pre-screening can be done based on supervised and unsupervised approaches. Tools for variables selection that are based solely on PCA information include (from King and Jackson, Martens et al., Lu et al., Wold et al. PAYWALL)

residual variation (excluding variables that retain excessive variation after the model, this is essentially the inverse statement of method ‘B4’ in King and Jackson
variables that give the strongest contributions in rejected PCs
jack-knifing (consistency during cross validation)
variable contribution in PCs selected for model inclusion (‘B2’ in King and Jackson).
Transformation of the original variables into an alternative feature space can also facilitate variable selection

These are also transferable to PLS and when you use $Y$ to direct the variable selection/suppression then you have a supervised method. Some popular methods I am less able to comment on usefully include LASSO Tibshirani, Ridge regression Marquardt, elastic net Zou and Hastie as well as AR/ARIMA which was used in the references you provided. All of these methods are covered extensively in CV.

why subset X and do PCA and then regression if your objective is to forecast Yt+h?

A common motivation for this route is to avoid overfitting the data reduction. The more variables you have per time point or observation the greater the risk of false positives occurring. In fact it is good practice to check variable and sample leverage in such situations as the fewer variables that leverage a component, the higher risk that they are spurious Mejia et al.. A key issue for PCA is that every unrelated variable contributes variance to the dataset and if irrelevant variables outweigh relevant ones then the total variance of irrelevant variables can promote these to higher ranked PCs, suppressing the relevant ones. Removing the most irrelevant ones first suppresses their variance and so makes it more likely that PCA will identify relevant parameters.

Why not employ PLS and use Yt+h to guide the extraction of factors from X?

Neither paper references PLS to explain why PCA was chosen. There is no a-priori reason to exclude PLS from consideration, but due to the low ratio of time points to variables great care should be take to evaluate for evidence of risk of overfitting. You are right to suspect that because PLS actively seeks relevant variables and rewards the Y-covariance of variables it should be able to ignore much of the irrelevant variables. However, it is important to realise that once information and noise are entangled PLS can't separate them. Some of the useful information in the relevant variables will have a covariance with the irrelevant information in the other variables. So either PLS will incorporate the irrelevant information, creating an unstable and noisy model (overfitted) or it will reject the relevant information co-mixed with the irrelevant and underestimate the relationship (under fitting). Unless you have a strong theoretical reason for believing in a particular data generating process you are best trying both approaches. Also, there are some limitations to PLS to be aware of, see CV QA on PLS criticism

Screening data prior to PCA v. PLS

1 Answers1