Mean centering for PCA in a 2D array...across rows or cols?

Question

I'm pretty new at this and I'm picking my way through the steps for running PCA on a 2D numpy array. Each subarray represents all pixels of an image (all rows & cols flattened). Example:

a = np.array([ [1,2,3], [4,5,6], [7,8,9] ])
# so, a[0], a[1], a[2] each represent a separate image

I'll use numpy.cov to calculate the covariance across these subarrays but first I need to mean-center the data, and that's where I'm getting confused.

My novice question is: Should mean centering occur within each subarray? That is, should I calculate the mean of [1,2,3] and subtract it from each element, resulting in [-1,0,1], and then do the same to the next two subarrays (i.e., each subarray would get its own mean subtracted from each element)? Or, should mean centering occur across arrays? If so, across rows or cols?

I've seen examples online of mean centering by calculating the mean along axis=0 (rows) (e.g., http://www.janeriksolem.net/2009/01/pca-for-images-using-python.html) and axis=1 (cols) (e.g., http://glowingpython.blogspot.it/2011/07/pca-and-image-compression-with-numpy.html). But I honestly don't know which is appropriate in this case.

a:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

np.mean(a):
5.0

np.mean(a, axis=0):
[ 4.  5.  6.]

np.mean(a, axis=1):
[ 2.  5.  8.]

# which of the following mean-centered results makes sense?

a - np.mean(a):
[[-4. -3. -2.]
 [-1.  0.  1.]
 [ 2.  3.  4.]]

a - np.mean(a, axis=0):
[[-3. -3. -3.]
 [ 0.  0.  0.]
 [ 3.  3.  3.]]

a - np.mean(a, axis=1):
[[-1. -3. -5.]
 [ 2.  0. -2.]
 [ 5.  3.  1.]]

score 4 · Accepted Answer · answered Sep 13 '12 at 20:17

4

Usually, each row is an "observation" (in your case image), and each column is a variable (in your case pixel value). Therefore, you should center and scale the columns before doing PCA.

Also, lots of good PCA libraries already exist, such as sklearn.decomposition.PCA, which can save you a lot of effort re-inventing the wheel. But if you persist in implementing PCA yourself you should probably do so via svd rather than via the co-variance matrix.

answered Sep 13 '12 at 20:17

Zach

22,308
18
114
158

So in my example output above, the final printed array (i.e., `[[-1. -3. -5.], [ 2. 0. -2.], [ 5. 3. 1.]]`) is the correct one given the input data? Re: scaling columns, do you mean to divide by the std deviation along the same axis, following mean centering? I have the sklearn module installed and tried sklearn.decomposition.PCA. I may return to it, but I first want to work on understanding how to calculate each piece of the PCA operation rather than just rely on a black box. – vulture Sep 13 '12 at 20:54
@vulture: I calculate the means for each column in your data to be `[4,5,6]`, so subtracting the mean from each column gives `[[-3 -3 -3], [0 0 0], [3 3 3]]`. Re: scaling, that's exactly what I mean: divide by the column's sd. – Zach Sep 13 '12 at 22:01
It appears I've mixed up numpy's axis statements. axis=0 calculates through columns and axis=1 calculates across rows, which matches your calculations. Sorry about that...thanks for the help. – vulture Sep 13 '12 at 23:15
@vulture No worries! In general, indexing in python starts at 0. Some other languages (such as R) start at 1. – Zach Sep 14 '12 at 00:18

score 1 · Answer 2 · answered Sep 13 '12 at 20:00

1

It depends on the way the data is set up. For instance, if you would calculate the covariance matrix as $\Sigma=\frac{1}{n}X'X$ where $X$ is the de-meaned data, then you would want to remove the mean from each column, and vice-versa.

answered Sep 13 '12 at 20:00

John

2,117
16
24

Mean centering for PCA in a 2D array...across rows or cols?

2 Answers2