Multivariate Solutions to Nonlinear Data

Question

I've been surveying the different methods of approaching linear multivariate problems (ex PCA, PLS, factor analysis etc.) and want generate a model for Y's that depend non-linearly on $X$'s via linearizations of the $X$'s. But I have not found much with regards to the process of linearizing variables such that I can use one of these linear models. Thus at the moment I am fairly blind to any pitfalls from doing this naively. Two specifics come to mind:

(1) It seems that standardizing variables is common but it is not clear to me how to do it if I presume a nonlinear relation (it may be just that I don't understand why we standardize). Say that I presume $Y = a X^2$, where I want to determine $a$, I could standardize $X$ (and $Y$?), then linearize, and then fit. or, I could linearize via $U = X^2$, then standardize $U$ (and $Y$?), then fit. Intuitively, these operations don't commute so I would expect two fits, but there is only one value of $a$ by definition, how can I resolve this conflict?

(2) I have many $X$'s and I presume that some $X$'s will have a nonlinear relation with $Y$. So far, I have computed different transforms of my $X$'s ($\log(X)$, $X^2$, etc.) and allowed the fitting routine to pick out which ever it wants (say via step-wise regression). However, I have little intuition as to how to regard $X$ and it's transformations; should I allow the fitting routine to pick one 'version' $X$ and preclude having a model that looks like $Y = a_1 \log(X_1) + a_2 X_1^2 + a_3 X_1^{-1} + a_4 \log(X_2) + a_5 X_2^2 + a_6 X_2^{-1}$, and instead enforce that there can only be one term with $X_1$ and one term with $X_2$?

On the one hand, these different transforms are highly correlated (though non-linearly) with $X_i$ by construction and now I'm thinking about maybe having a co-linearity problem. At the same time, polynomial solutions are fairly standard even though $X^n$ and $X^m$ are correlated just the same for $n\neq m$.

I would note that I have tried looking for "nonlinear multivariate methods" but the only sources I have found are (at the moment) way above my head.

Thank you for any guidance!

In response to your second comment (I was character limited in the comment):

The bare-bones problem is this: I have one measurement ($Y$) that I want to model, and I want to use several other measurements (the different $X_i$'s) to model $Y$.

Because I have several $X_i$, I understood this to be a multivariate problem. The simplest model I could do is something like $Y=b+\sum_i a_i X_i$. However, I don't want to presume linearity between $Y$ and each $X_i$, so I want to explore linearity between $Y$ and different transforms of $X_i$ (ie $X_i^2$, $\log(X_i)$, etc).

On the one hand, I increase the number of variables by the number of transforms I want to consider, so computationally this gets expensive, so there is an incentive to not over-reach. At the same time, I know to consider screening $X_i$ variables that are strongly correlated with each other, which has the computational benefit of reducing how many variables I use to make the model. I just don't understand if I should screen within a family of $X_i$ transforms in some way. If not, then it seems like I would treat each transform of $X_i$ as a new $X_i$ itself. But then I worry that if I build a screening protocol, I would reject transforms of $X_i$ since by construction they are correlated. Sorry if this isn't any clearer.

(1) You don't need operations to commute--you only need to be able to undo one, with the result being a model of the type you want. For instance, *rescaling* $y$ and $x$ in your quadratic example $y=ax^2+\epsilon$ would merely change the estimate of $a$ in a predictable way that doesn't ruin the error structure. That could be readily adjusted to give the result for unscaled variables. However, *standardizing* $x$ and $y$ would result in a final model of the form $y=ax^2+bx+c+\epsilon$. (2) Your questions don't seem to have anything essential to do with the multivariate nature of $y$. — whuber, Nov 01 '17 at 22:58
I'm confused as to why my second question might not be relevant; from what I've read I want to avoid either a multicolinearity problem when modeling $Y$ but at the same time I don't want to have omitted variable bias. Reading about these people mention how to deal with different $X$'s that may or may not be correlated, but what I want to know if how to incorporate transforms of individual $X$'s which by construction are related to each other. — Francisco C, Nov 02 '17 at 15:19
Perhaps you have a different understanding of "multivariate": that refers to a vector *response*, not multiple regressors. Which situation are you trying to describe? — whuber, Nov 02 '17 at 15:36
Thank you. Since this is a standard problem of multiple regression--not multivariate analysis or PCA--I have changed the tags accordingly. — whuber, Nov 02 '17 at 16:23

score 2 · Accepted Answer · answered Nov 02 '17 at 18:54

1) I don't see any need to standardize variables in this case. That can be useful if predictor variables are on highly different scales while you wish to consider them on similar scales, for example in principal-components regression, or in ridge regression or LASSO in which the magnitudes of all coefficients are penalized together. For a standard multiple regression* like this, coefficients are typically most interpretable if the predictors are in some understandable units.

2) The issues you face are a combination of data-transformation and feature-selection, which each have about 1400 questions on this site. To summarize some key points:

a) You probably should not be using stepwise selection to find either the best predictors or their transformations. Such results typically do not generalize well to other data samples from the same population, and the p-values after stepwise selection would be wrong.

b) Transformations based on your knowledge of the subject matter are often preferred. For example, as @whuber pointed out in an answer related to Box-Cox transformations, reciprocal absolute temperature has a strong basis for applications in physical chemistry. Logarithmic and square-root transformations not only can often be practically useful but also can make sense in terms of the subject matter.

c) If you don't have a particular functional form of a transformation based on the subject matter, restricted cubic splines can provide useful flexibility for transforming predictor variables without going through the types of contortions seen in your complicated example with its logs, inverses, and powers. In R, the rms package implements such splines, which can avoid problems that often occur with polynomial fits near the extreme values of the data. You can then plot the splines that are found, which might suggest a previously unsuspected functional form.

Presumably you are looking for a model (involving transformations of the predictors and possibly of the outcome variable, too) that fits the data reasonably well, with residuals that are relatively independent of the predicted outcome values. First evaluate the transformations and models on that basis. If you have more than one such model you could use evaluation with cross-validation or bootstrapping to find the one that works the best for your purposes. If you care about things like p-values and significance tests, however, note that you need to account for your use of the outcomes to select the model; simply plugging your carefully selected model back into a standard regression routine will not give correct p-values.

*As @whuber pointed out, this is not strictly a "multivariate" analysis as there is only a single scalar outcome variable. Many, particularly in biomedical fields, nevertheless do call multiple regressions "multivariate" analysis.

Multivariate Solutions to Nonlinear Data

1 Answers1