10

I'm trying to reduce the dimensionality and noise of a dataset by performing PCA on the dataset and throwing away the last few PCs. After that, I want to use some machine learning algorithms on the remaining PCs, and therefore I want to normalize the data by equalizing the variance of the PCs to make the algorithms work better.

One simple way is to simply normalize the variance to unit values. However, the first PC contains more variance from the original dataset than the following ones, and I still want to give it more "weight". Therefore I was wondering: is there a simple way to just split its variance and share it with the PCs with less variances?

Another way is to map the PCs back to the original feature space, but in that case the dimensionality would also increase to the original value.

I guess it's better to keep the resultant columns orthogonal, but it's not necessary at this moment.

amoeba
  • 93,463
  • 28
  • 275
  • 317
feilong
  • 209
  • 2
  • 7
  • 1
    No...varimax maximizes the sum of the squared *variances* of the loadings, so its trying to make them as *unequal* as possible. Also, why would you want to equalize the components? The whole point is to capture as much variation as possible in as few components as possible. –  Oct 18 '15 at 19:24
  • 2
    Does simply standardizing the component scores to unit variances not suit you? Why then? What kind of result do you want - should the resultant columns be uncorrelated in addition to equal variances? – ttnphns Oct 18 '15 at 19:51
  • 2
    From your description it looks very much like you want simply to "sphere" the data (of reduced dimensionality). It is frequently done as a preprocessing step in machine learning. In order to achieve it, you simply perform PCA, choose some components, and standardize them. I guess it is possible to find an orthogonal rotation (such as varimax) that rotates standardized components such that they remain uncorrelated but explain exactly the same amount of variance; that's an interesting question, I need to think about it. But I have never seen this done, definitely not in machine learning. – amoeba Oct 18 '15 at 22:34
  • 2
    By the way, what are "some machine learning algorithms" that you want to apply after PCA? This might be relevant. – amoeba Oct 18 '15 at 22:35
  • I want to apply one-nearest-neighbor classification with correlation-based distance to compare my algorithm with previous ones. Based on my experience so far, kNN with euclidean distance and SVM work better on PCs, so I was thinking maybe correlation-based kNN is more susceptible to unequal variance or noisy features, and some equalization may help. @amoeba – feilong Oct 19 '15 at 00:50
  • 1
    Note that if you rotate your standardized PCs, then the distances *will not change at all!* So it really should not matter for any subsequent distance-based algorithm. – amoeba Oct 19 '15 at 09:30
  • 1
    @amoeba You're right, I just realized the correlation distance is related to the angle between two vectors, and that doesn't change with rotations. – feilong Oct 19 '15 at 18:14
  • Have you looked at the PCA derived "whitening" and "de-whitening" (aka pinkening" transforms? (http://ufldl.stanford.edu/wiki/index.php/Implementing_PCA/Whitening) – EngrStudent Oct 22 '15 at 18:52
  • 1
    @EngrStudent That's what I referred to as normalization in the 2nd paragraph. I guess it's just different terms in different fields. – feilong Oct 22 '15 at 19:01

3 Answers3

11

It is not completely clear to me that what you are asking is what you really need: a common preprocessing step in machine learning is dimensionality reduction + whitening, which means doing PCA and standardizing the components, nothing else. But I will nevertheless focus on your question as it is formulated, because it's more interesting.


Let $\mathbf X$ be the centered $n\times d$ data matrix with data points in rows and variables in columns. PCA amounts to singular value decomposition $$\mathbf X = \mathbf{USV}^\top \approx \mathbf U_k \mathbf S_k \mathbf V_k^\top,$$ where to perform the dimensionality reduction we keep only $k$ components. An orthogonal "factor rotation" of these components implies choosing an orthogonal $k \times k$ matrix $\mathbf R$ and plugging it into the decomposition: $$\mathbf X \approx \mathbf U_k \mathbf S_k \mathbf V_k^\top = \mathbf U_k \mathbf {RR}^\top \mathbf S_k \mathbf V_k^\top = \underbrace{\sqrt{n-1}\mathbf U_k^\phantom\top \mathbf {R}}_{\substack{\text{Rotated}\\\text{standardized scores}}} \cdot \underbrace{\mathbf R^\top \mathbf S_k \mathbf V_k^\top/\sqrt{n-1}}_{\text{Rotated loadings}^\top}.$$ Here $\sqrt{n-1}\mathbf U_k \mathbf R$ are rotated standardized components and the second term represents rotated loadings transposed. The variance of each component after rotation is given by the sum of squares of the corresponding loading vector; before rotation it is simply $s_i^2/(n-1)$. After rotation it is something else.

Now we are ready to formulate the problem in mathematical terms: given unrotated loadings $\mathbf L = \mathbf V_k \mathbf S_k / \sqrt{n-1}$, find rotation matrix $\mathbf R$ such that the rotated loadings, $\mathbf L \mathbf R$, has equal sum of squares in each column.

Let's solve it. Column sums of squares after rotation are equal to the diagonal elements of $$(\mathbf {LR})^\top \mathbf{LR} = \mathbf R^\top \frac{\mathbf S^2}{n-1} \mathbf R.$$ This makes sense: rotation simply redistributes the variances of components, which are originally given by $s_i^2/(n-1)$, between them, according to this formula. We need to redistribute them such they all become equal to their average value $\mu$.

I don't think there is a closed form solution to this, and in fact there are many different solutions. But a solution can be easily built in a sequential fashion:

  1. Take the first component and the $k$-th component. The first one has variance $\sigma_\text{max}>\mu$ and the last one has the variance $\sigma_\text{min}<\mu$.
  2. Rotate only these two such that the variance of the first becomes equal to $\mu$. Rotation matrix in 2D depends only on one parameter $\theta$ and it is easy to write down the equation and compute the necessary $\theta$. Indeed, $$\mathbf R_\text{2D} = \left(\begin{array}{cc}\cos \theta & \sin \theta \\ -\sin\theta & \cos \theta\end{array}\right)$$ and after transformation the first PC will get variance $$\cos^2\theta \cdot \sigma_\text{max} + \sin^2\theta \cdot \sigma_\text{min} = \cos^2\theta \cdot \sigma_\text{max} + (1-\cos^2\theta)\cdot \sigma_\text{min} =\mu,$$ from which we immediately obtain $$\cos^2\theta = \frac{\mu-\sigma_\text{min}}{\sigma_\text{max}-\sigma_\text{min}}.$$
  3. The first component is now done, it has variance $\mu$.
  4. Proceed to the next pair, taking the component with the largest variance and the one with the smallest variance. Goto #2.

This will redistribute all variances equally by a sequence of $(k-1)$ 2D rotations. Multiplying all these rotation matrices together will yield the overall $\mathbf R$.


Example

Consider the following $\mathbf S^2/(n-1)$ matrix: $$\left(\begin{array}{cccc}10&0&0&0\\0&6&0&0\\0&0&3&0\\0&0&0&1\end{array}\right).$$ The mean variance is $5$. My algorithm will proceed as follows:

  1. Step 1: rotate PC1 and PC4 so that PC1 gets variance $5$. As a result, PC4 gets variance $1+(10-5)=6$.

  2. Step 2: rotate PC2 (new maximal variance) and PC3 so that PC2 gets variance $5$. As a result, PC3 gets variance $3+(6-5)=4$.

  3. Step 3: rotate PC4 (new maximal variance) and PC3 so that PC4 gets variance $5$. As a result, PC3 gets variance $4+(6-1)=5$.

  4. Done.

I wrote the Matlab script that implements this algorithm (see below). For this input matrix, the sequence of rotation angles is:

48.1897   35.2644   45.0000

Component variances after each step (in rows):

10     6     3     1
 5     6     3     6
 5     5     4     6
 5     5     5     5

The final rotation matrix (product of three 2D rotation matrices):

 0.6667         0    0.5270    0.5270
      0    0.8165    0.4082   -0.4082
      0   -0.5774    0.5774   -0.5774
-0.7454         0    0.4714    0.4714

And the final $(\mathbf{LR})^\top \mathbf{LR}$ matrix is:

5.0000         0    3.1623    3.1623
     0    5.0000    1.0000   -1.0000
3.1623    1.0000    5.0000    1.0000
3.1623   -1.0000    1.0000    5.0000

Here is the code:

S = diag([10 6 3 1]);
mu = mean(diag(S));
R = eye(size(S));

vars(1,:) = diag(S);
Supdated = S;

for i = 1:size(S,1)-1
    [~, maxV] = max(diag(Supdated));
    [~, minV] = min(diag(Supdated));

    w = (mu-Supdated(minV,minV))/(Supdated(maxV,maxV)-Supdated(minV,minV));
    cosTheta = sqrt(w);
    sinTheta = sqrt(1-w);

    R2d = eye(size(S));
    R2d([maxV minV], [maxV minV]) = [cosTheta sinTheta; -sinTheta cosTheta];
    R = R * R2d;

    Supdated = transpose(R2d) * Supdated * R2d;    

    vars(i+1,:) = diag(Supdated);
    angles(i) = acosd(cosTheta);
end

angles                %// sequence of 2d rotation angles
round(vars)           %// component variances on each step
R                     %// final rotation matrix
transpose(R)*S*R      %// final S matrix

Here is the code in Python provided by @feilong:

def amoeba_rotation(s2):
    """
    Parameters
    ----------
    s2 : array
        The diagonal of the matrix S^2.

    Returns
    -------
    R : array
        The rotation matrix R.

    Examples
    --------
    >>> amoeba_rotation(np.array([10, 6, 3, 1]))
    [[ 0.66666667  0.          0.52704628  0.52704628]
     [ 0.          0.81649658  0.40824829 -0.40824829]
     [ 0.         -0.57735027  0.57735027 -0.57735027]
     [-0.74535599  0.          0.47140452  0.47140452]]

    http://stats.stackexchange.com/a/177555/87414
    """
    n = len(s2)
    mu = s2.mean()
    R = np.eye(n)
    for i in range(n-1):
        max_v, min_v = np.argmax(s2), np.argmin(s2)
        w = (mu - s2[min_v]) / (s2[max_v] - s2[min_v])
        cos_theta, sin_theta = np.sqrt(w), np.sqrt(1-w)
        R[:, [max_v, min_v]] = np.dot(
            R[:, [max_v, min_v]],
            np.array([[cos_theta, sin_theta], [-sin_theta, cos_theta]]))
        s2[[max_v, min_v]] = [mu, s2[max_v] + s2[min_v] - mu]
    return R

Note that this problem is completely equivalent to the following one: given $k$ uncorrelated variables with variances $\sigma_i^2$, find a rotation (i.e. a new orthogonal basis) that will yield $k$ variables with equal variances (but of course not uncorrelated anymore).

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • I guess, for any two pairs of components (their scores), the angle of rotation would be 45 degrees, to equalize their variances. However, I can't imagine how to do the whole task with 3+ components pairwisely. – ttnphns Oct 19 '15 at 03:07
  • @ttnphns I guess the idea is to equalize a pair of components at a time, and repeat the step until the difference among all components is tolerable. Eventually all components will have equal variance, though it may take many steps to approach that. – feilong Oct 19 '15 at 03:37
  • @ttnphns: If there are only two PCs selected, then the rotation will always be 45 degrees to equalize the variances, that's correct. Now imagine we select three PCs with variances 10, 5, and 3. The average value is 6. What I suggest is first to rotate PC1 and PC3 to make their variances 6 and 7. Then the first component is done, its variance is now 6. Note that the angle for this rotation is **not** 45 degrees! The two remaining components have variances 5 and 7. So now we rotate these two to make it 6 and 6. Now the overall rotation is composed of two 2D subrotations. Does it make sense? – amoeba Oct 19 '15 at 09:37
  • 1
    @feilong, I think equalizing the variance of a pair of components at a time is a very suboptimal algorithm. What I suggested is to choose the rotations such that the variance of one component becomes exactly equal to the global mean variance. Then this component is "done", and one can deal with the rest. This is guaranteed to equalize all variances in a finite number of steps. See my previous comment for an example. – amoeba Oct 19 '15 at 09:38
  • @amoeba, what you are saying is interesting and deserving a try. Are thinking maybe to try it if you have time and show us a (pseudo)code to follow, it it works? – ttnphns Oct 19 '15 at 10:25
  • 1
    @amoeba You're right, that's a better solution, and should finish with n-1 steps. – feilong Oct 19 '15 at 13:20
  • @amoeba, it would be great if you expand a bit your point 2 regarding the computation of the cosine of theta: i.e. how you come to the line computing sqrt(w) in your code. A reader might want to have that idea as already digested by you. Thank you. – ttnphns Oct 21 '15 at 17:41
  • @ttnphns, I made the update, please take a look if it's clear enough. (And thanks for the edit you did.) – amoeba Oct 21 '15 at 20:04
  • 1
    @amoeba I've added my minimal implementation using Python. I modified the part multiplying the whole matrix, as that can be time-consuming for large matrices. – feilong Oct 22 '15 at 18:47
  • 1
    @amoeba Specifically for principle components, it's possible to save more time by removing the part searching for the maximum and minimum. We can simply rotate the 1st and 2nd components (to make the 1st component have average variance), and then the 2nd and 3rd, and so on. We just need to make sure the total variance of each pair is larger than `mu`. – feilong Oct 22 '15 at 18:52
  • Thanks, @feilong. These are good suggestions to improve both the algorithm and the implementation. – amoeba Oct 22 '15 at 19:06
  • @amoeba, where does the equation, $\sigma_{max} \cos^2 \Theta + \sigma_{min} \sin^2 \Theta = \mu$ come from? You haven't said anything about that principal thing. – ttnphns Oct 23 '15 at 10:11
  • ...and the sigmas, I believe, should be shown squared, 'cause they are variances. – ttnphns Oct 23 '15 at 10:13
2

In his perspicacious and comprehensive answer @amoeba has shown - as part of the answer - how one can rotate two uncorrelated variables (such as principal components for example) to achieve the wanted variances for them (while at expense of losing uncorrelatedness, of course). Let orthogonal variables $X$ and $Y$ have variances $\sigma^2_{max}$ (a larger) and $\sigma^2_{min}$ (a smaller), respectively. Rotate them so that $X$ will get arbitrary, diminished variance $\mu^2$ (while $Y$, consequently, will become of the variance $\sigma^2_{max}+\sigma^2_{min}-\mu^2$).

@amoeba shows the formula from which we can compute the angle of such rotation, $\cos\theta$:

$$\mu^2 = \cos^2\theta (\sigma^2_{max}) + \sin^2\theta (\sigma^2_{min})$$

but has not demonstrated where this equation comes from; probably thinking that it is obvious without explanation. Obvious or not, I believe it is worth elucidating - some way. My answer presents one way.

And so, we have an ellipsoidal, centered data cloud in the space of uncorrelated variables $X$ and $Y$. We have to rotate the axes by angle $\theta$. A data point in the cloud (such as shown as green spot on the picture) with $X$ coordinate $x$ will have this coordinate as $x^*$ after the rotation.

illustration of the rotation

Observe that projection of the coordinate $x$ notch onto the rotated axis $X^*$ is given by $x'=x\cos\theta$ (cathetus as the hypotenuse and the angle between them). Observe also that $x^*$ is less than $x'$ by the cut of length $x'-x^*$ computable from coordinate $y$: $y\sin\theta$ (another cathetus and hypotenuse). And so,

$$x^* = x' - (x'-x^*) = x\cos\theta-y\sin\theta$$

We know (see the beginning) the variances (or sums-of squares) of the two variables and the variance (sum-of squares) $\mu^2$ of $X^*$. Then it follows:

$$\mu^2=\sum x^{*2} = \sum(x\cos\theta-y\sin\theta)^2 = \sum(x^2\cos^2\theta+y^2\sin^2\theta-2xy\cos\theta\sin\theta) = \cos^2\theta\sum x^2 + \sin^2\theta\sum y^2 - \underbrace{ 2\cos\theta\sin\theta\sum xy}_{\text{=0 (X and Y are uncorrelated)}} = \cos^2\theta (\sigma^2_{max}) + \sin^2\theta (\sigma^2_{min})$$

From which you estimate $\cos\theta$, as @amoeba has shown, and perform the rotation.

feilong
  • 209
  • 2
  • 7
ttnphns
  • 51,648
  • 40
  • 253
  • 462
  • 2
    +1. I didn't think that it is obvious (it isn't), but I rather thought that it is *easy to verify* :-) One can also show it by direct algebra, writing down (as in my answer) $$\left(\begin{array}{cc}\cos \theta & \sin \theta \\ -\sin\theta & \cos \theta\end{array}\right)^\top \left(\begin{array}{cc} \sigma_\text{max}^2 & 0 \\ 0 & \sigma_\text{min}^2\end{array}\right) \left(\begin{array}{cc}\cos \theta & \sin \theta \\ -\sin\theta & \cos \theta\end{array}\right),$$ and computing the upper-left element of the product. That's of course the same reasoning, just expressed differently. Thanks! – amoeba Oct 23 '15 at 16:14
  • And I do think that your geometric explanation and "direct" computation (without matrices) is easier to understand and very helpful to develop the right intuitions. – amoeba Oct 23 '15 at 16:16
0

If I interpret things correctly, you mean that the first principle component (eigenvalue) explains most of the variance in the data. This can happen when your compression method is linear. However, there might be non-linear dependencies in your feature space.

TL/DR: PCA is a linear method. Use Autoencoders (non-linear pca) for dimensionality reduction. If the machine learning part is supervised learning then simply monitor your loss function while adjusting the (hyper)parameters for the autoencoder. In this way you will end up with a far better compressed version of your original data.

Here's a scikit example where they do grid search to find the optimal number of principal components to keep (hyper-parameter) using PCA. Finally they apply Logistic Regression on the lower dimensional space: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#example-plot-digits-pipe-py

Protip: Autoencoders do not have a closed form solution (afaik) so if your context is streaming data, this means you can continuously update your autoencoder (compressed representation) and can thus compensate for things such as concept drift. With pca you have to re-train batch mode from time to time as new data comes in.

As to giving some features more "weight", see regularization ( I'd start from norms https://en.wikipedia.org/wiki/Norm_(mathematics) ). You might also be surprised how similar logistic regression is to the perceptron.

user91213
  • 1,786
  • 13
  • 26
  • I don't see how this answers the OP's question; your answer seems be entirely unrelated to the question. – amoeba Oct 22 '15 at 22:11
  • _Therefore I was wondering: is there a simple way to just split its variance and share it with the PCs with less variances?_ OP wants to do dimensionality reduction. I offered an alternative to solve his problem, since ultimately what OP wants does not guarantee to result in better performance unless performance is measured. Working in hilbert spaces / normed spaces doesn't guarantee better results. Measuring performance leads to better results. – user91213 Oct 23 '15 at 04:44