While working my way through M. Nielsen's "Neural networks and deep learning", I decided to try out some presumably silly things to really understand why they won't work and/or why it's not a good idea to use them.
The problem at hand is to recognize digits from the MNIST database. I built in Python a network with 784 input neurons, 10 sigmoid output neurons and cross entropy as the cost function. I trained it via the gradient descent, using backpropagation over the entire dataset (50 000 images). This is what I refer to as silly -- I know there's not much upside to doing this. But I still wanted to compare the behaviour of this network for different batch sizes. Here's the code for reference:
import numpy as np
import random
# Import data
import mnist_loader
training_data, validation_data, test_data = \
mnist_loader.load_data_wrapper()
# Auxiliary functions
def sigmoid(arg):
return 1.0/(1.0 + np.exp(-arg))
def sigmoid_prime(arg):
return sigmoid(arg)*(1.0-sigmoid(arg))
def feedforward(w, x, b):
return sigmoid(np.matmul(w, x) + b)
def cross_entropy(arg, y):
return np.nan_to_num(-y*np.log(arg) - (1.0-y)*np.log(1.0-arg))
# Initialize weights and biases
w = np.random.rand(10, 784)
b = np.random.rand(10, 1)
# Hyper-parameters
eta = 0.5
iterations_number = int(100)
for counter in range(iterations_number):
increment_w = 0
increment_b = 0
for k in range(50000):
x = training_data[k][0]
y = training_data[k][1]
# Feedforward
output = feedforward(w, x, b)
# Backpropagate
increment_b += output-y
increment_w += np.matmul(output-y, x.T)
# Gradient step
w -= (eta/50000)*increment_w
b -= (eta/50000)*increment_b
This works pretty decently, given how simple it is: accuracy on test data is a hair above 90%. Encouraged with how it went, I moved on to a more compliacated network: I added a single hidden layer consisting of 30 sigmoid neurons, and trained the new network as above, using backpropagation over the entire dataset. Here's the code again (I left out the "Import data" and "Auxiliary functions" parts, they remain exactly the same as above):
# Initialize weights and biases
r = 1
w2 = r*np.random.rand(30, 784)
w3 = r*np.random.rand(10, 30)
b2 = r*np.random.rand(30, 1)
b3 = r*np.random.rand(10, 1)
# Hyper-parameters
eta = 0.5
iterations_number = 30
for counter in range(iterations_number):
cost = 0
increment_w2 = 0
increment_w3 = 0
increment_b2 = 0
increment_b3 = 0
for k in range(50000):
x = training_data[k][0]
y = training_data[k][1]
# Feedforward
a2 = feedforward(w2, x, b2)
a3 = feedforward(w3, a2, b3)
# Backpropagate
delta3 = a3-y
delta2 = np.matmul(w3.T, delta3)*sigmoid_prime(a2)
increment_w3 += np.matmul(delta3, a2.T)
increment_w2 += np.matmul(delta2, x.T)
increment_b3 += delta3
increment_b2 += delta2
# Increment cost
cost += sum(cross_entropy(a3, y))
# Gradient step
w2 -= (eta/50000)*increment_w2
w3 -= (eta/50000)*increment_w3
b2 -= (eta/50000)*increment_b2
b3 -= (eta/50000)*increment_b3
# Cost function tracking
print 'Cost no. ' + str(counter) + ': ' + str((1.0/50000)*cost)
# Testing accuracy on test data
test_results = []
for k in range(10000):
x = test_data[k][0]
y = test_data[k][1]
a2 = feedforward(w2, x, b2)
a3 = feedforward(w3, a2, b3)
test_results.append((np.argmax(a3), y))
final = sum(int(x == y) for (x, y) in test_results)
print 'Iteration no. ' + str(counter) + ': ' + str(final)
But this fails horribly! The cost function decreases very quickly, but the learning does not occur at all: the algorithm predicts that every single digit in the test data is 1. I've spent quite some time trying to understand what is going on here, but I'm lost.
So, my question is: where does this lack of learning come from? Any help will be greatly appreciated!