I'm pretty new at this and I'm picking my way through the steps for running PCA on a 2D numpy array. Each subarray represents all pixels of an image (all rows & cols flattened). Example:
a = np.array([ [1,2,3], [4,5,6], [7,8,9] ])
# so, a[0], a[1], a[2] each represent a separate image
I'll use numpy.cov to calculate the covariance across these subarrays but first I need to mean-center the data, and that's where I'm getting confused.
My novice question is: Should mean centering occur within each subarray? That is, should I calculate the mean of [1,2,3] and subtract it from each element, resulting in [-1,0,1], and then do the same to the next two subarrays (i.e., each subarray would get its own mean subtracted from each element)? Or, should mean centering occur across arrays? If so, across rows or cols?
I've seen examples online of mean centering by calculating the mean along axis=0 (rows) (e.g., http://www.janeriksolem.net/2009/01/pca-for-images-using-python.html) and axis=1 (cols) (e.g., http://glowingpython.blogspot.it/2011/07/pca-and-image-compression-with-numpy.html). But I honestly don't know which is appropriate in this case.
a:
[[1 2 3]
[4 5 6]
[7 8 9]]
np.mean(a):
5.0
np.mean(a, axis=0):
[ 4. 5. 6.]
np.mean(a, axis=1):
[ 2. 5. 8.]
# which of the following mean-centered results makes sense?
a - np.mean(a):
[[-4. -3. -2.]
[-1. 0. 1.]
[ 2. 3. 4.]]
a - np.mean(a, axis=0):
[[-3. -3. -3.]
[ 0. 0. 0.]
[ 3. 3. 3.]]
a - np.mean(a, axis=1):
[[-1. -3. -5.]
[ 2. 0. -2.]
[ 5. 3. 1.]]