Batch learning in digits recognition (MNIST database)

Question

While working my way through M. Nielsen's "Neural networks and deep learning", I decided to try out some presumably silly things to really understand why they won't work and/or why it's not a good idea to use them.

The problem at hand is to recognize digits from the MNIST database. I built in Python a network with 784 input neurons, 10 sigmoid output neurons and cross entropy as the cost function. I trained it via the gradient descent, using backpropagation over the entire dataset (50 000 images). This is what I refer to as silly -- I know there's not much upside to doing this. But I still wanted to compare the behaviour of this network for different batch sizes. Here's the code for reference:

import numpy as np
import random

# Import data
import mnist_loader
training_data, validation_data, test_data = \
mnist_loader.load_data_wrapper()

# Auxiliary functions
def sigmoid(arg):
  return 1.0/(1.0 + np.exp(-arg))

def sigmoid_prime(arg):
  return sigmoid(arg)*(1.0-sigmoid(arg))

def feedforward(w, x, b):
  return sigmoid(np.matmul(w, x) + b)

def cross_entropy(arg, y):
  return np.nan_to_num(-y*np.log(arg) - (1.0-y)*np.log(1.0-arg))

# Initialize weights and biases
w = np.random.rand(10, 784)
b = np.random.rand(10, 1)

# Hyper-parameters
eta = 0.5
iterations_number = int(100) 

for counter in range(iterations_number):
  increment_w = 0
  increment_b = 0 
  for k in range(50000): 
    x = training_data[k][0]
    y = training_data[k][1]
    # Feedforward    
    output = feedforward(w, x, b)
    # Backpropagate
    increment_b += output-y
    increment_w += np.matmul(output-y, x.T)
  # Gradient step  
  w -= (eta/50000)*increment_w
  b -= (eta/50000)*increment_b

This works pretty decently, given how simple it is: accuracy on test data is a hair above 90%. Encouraged with how it went, I moved on to a more compliacated network: I added a single hidden layer consisting of 30 sigmoid neurons, and trained the new network as above, using backpropagation over the entire dataset. Here's the code again (I left out the "Import data" and "Auxiliary functions" parts, they remain exactly the same as above):

# Initialize weights and biases
r = 1
w2 = r*np.random.rand(30, 784)
w3 = r*np.random.rand(10, 30)
b2 = r*np.random.rand(30, 1)
b3 = r*np.random.rand(10, 1)

# Hyper-parameters
eta = 0.5
iterations_number = 30

for counter in range(iterations_number):
  cost = 0
  increment_w2 = 0
  increment_w3 = 0  
  increment_b2 = 0
  increment_b3 = 0 
  for k in range(50000):
    x = training_data[k][0]
    y = training_data[k][1]
    # Feedforward    
    a2 = feedforward(w2, x, b2)      
    a3 = feedforward(w3, a2, b3)
    # Backpropagate
    delta3 = a3-y
    delta2 = np.matmul(w3.T, delta3)*sigmoid_prime(a2) 
    increment_w3 += np.matmul(delta3, a2.T)
    increment_w2 += np.matmul(delta2, x.T)
    increment_b3 += delta3
    increment_b2 += delta2 
    # Increment cost 
    cost += sum(cross_entropy(a3, y))   
  # Gradient step  
  w2 -= (eta/50000)*increment_w2
  w3 -= (eta/50000)*increment_w3
  b2 -= (eta/50000)*increment_b2 
  b3 -= (eta/50000)*increment_b3
  # Cost function tracking
  print 'Cost no. ' + str(counter) + ': ' + str((1.0/50000)*cost)

  # Testing accuracy on test data
  test_results = []
  for k in range(10000):
    x = test_data[k][0]
    y = test_data[k][1]
    a2 = feedforward(w2, x, b2)      
    a3 = feedforward(w3, a2, b3) 
    test_results.append((np.argmax(a3), y))
  final = sum(int(x == y) for (x, y) in test_results)
  print 'Iteration no. ' + str(counter) + ': ' + str(final)

But this fails horribly! The cost function decreases very quickly, but the learning does not occur at all: the algorithm predicts that every single digit in the test data is 1. I've spent quite some time trying to understand what is going on here, but I'm lost.

So, my question is: where does this lack of learning come from? Any help will be greatly appreciated!

Batch learning in digits recognition (MNIST database)

0 Answers0