Why is the reconstruction error for my training set larger than my test error using PCA on the MNIST data set?

Question

I have a very strange behavior where I am trying to run PCA on the MNIST data set and then I check the test and train error. However, it seems that I get that the test error is lower than my train error, which seems totally bizarre for me. Is this normal? Has people ever experienced this type of behavior on the MNIST data set?

To make my bug totally reproducible I will give a link to my github code (i.e. the exact data set I used is there, and its small). Also, the code I ran is super simple:

clear;clc;
load('data_MNIST_original_minist_60k_10k_split_train_test');
[N_train, D] = size(X_train) % (N_train x D)
[N_test, D] = size(X_test) % (N_test x D)
%% preparing model
K = 10;
[coeff, ~, ~, ~, ~, mu] = pca(X_train); % (D x R) = Rows of X_train correspond to observations and columns to variables. 
%% Learn PCA
X_train = X_train'; % (D x N) = (N_train x D)'
X_test = X_test'; % (D x N) = (N_test x D)'
U = coeff(:,1:K); % (D x K) = K pca's of dimension D
X_tilde_train = (U * U' * X_train); % (D x N_train)= (D x N)' = ( (D x K) x (K x D) x (D x N) )'
X_tilde_test = (U * U' * X_test); % (D x N_test)= (D x N)' = ( (D x K) x (K x D) x (D x N) )'
train_error_PCA = (1/N_train)*norm( X_tilde_train - X_train ,'fro')^2
test_error_PCA = (1/N_test)*norm( X_tilde_test - X_test ,'fro')^2

then I got:

train_error_PCA =

   29.0759


test_error_PCA =

   28.8981

is this normal?

is this why people add background noise to the MNIST data set?

I am just using standard PCA library so I can't possibly see what might have gone wrong

I have also tested this with different non-linear auto-encoders, for example, I tried a 1 layer Radial Basis Function (RBF) and it produces the same strange result:

if you get close enough, the little circle (train) is above each start (test). Which is not expected.

I have now made the same test but with a more complicated MNIST data set that can be found (https://sites.google.com/a/lisa.iro.umontreal.ca/public_static_twiki/variations-on-the-mnist-digits) and I also included it with the git large file tracking (https://git-lfs.github.com/) on the repo (it will eventually reach github). Now with this it seems the train error is lower than test, as I assume it should be:

and ran the code:

clear;clc;
%load('data_MNIST_original_minist_60k_10k_split_train_test');
load('mnist_background_random_test');
load('mnist_background_random_train');
X_train = mnist_background_random_train(:,1:784);
X_test = mnist_background_random_test(:,1:784);
[N_train, D] = size(X_train) % (N_train x D)
[N_test, D] = size(X_test) % (N_test x D)
%%
[coeff, ~, ~, ~, ~, mu] = pca(X_train); % (D x R) = Rows of X_train correspond to observations and columns to variables. 

num_centers = 5; % <-- change
start_centers = 10;
end_centers = 500;
centers = floor(linspace(start_centers, end_centers, num_centers));

train_errors = zeros(1,num_centers);
test_errors = zeros(1,num_centers);
for i=1:num_centers
    K = centers(i);
    %% preparing model
    U = coeff(:,1:K); % (D x K) = K pca's of dimension D
    X_train_T = X_train'; % (D x N) = (N_train x D)'
    X_test_T = X_test'; % (D x N) = (N_test x D)'
    X_tilde_train = (U * U' * X_train_T); % (D x N_train)= (D x N)' = ( (D x K) x (K x D) x (D x N) )'
    X_tilde_test = (U * U' * X_test_T); % (D x N_test)= (D x N)' = ( (D x K) x (K x D) x (D x N) )'
    train_error_PCA = (1/N_train)*norm( X_tilde_train - X_train_T ,'fro')^2;
    test_error_PCA = (1/N_test)*norm( X_tilde_test - X_test_T ,'fro')^2;
    train_errors(i) = train_error_PCA;
    test_errors(i) = test_error_PCA;
end
fig = figure;
plot(centers, train_errors, '-ro', centers, test_errors, '-b*');
legend('train errors', 'test errors');

I will leave the question because I am not sure if I truly fixed the error or not.

I don't see how the answer to the other question applies to my question. The question suggests that I might need more training, which would make sense in the case of iterative methods with gradient descent, but PCA or kernel methods are trained in a single step using linear algebra rather than iterative methods, how does it apply to my scenario? — Charlie Parker, May 16 '16 at 13:36
Also, my scenario is totally different, auto-encoders don't (usually, except maybe variational auto-encoders) get that same graph where at some point the graph on test error starts to increase and train error starts to decrease. Usually, both keep decreasing the more "centers" or principal components that we allow. — Charlie Parker, May 16 '16 at 13:43

Why is the reconstruction error for my training set larger than my test error using PCA on the MNIST data set?

0 Answers0