Setting up a MLP for binary classification with tensorflow

Question

I have some troubles trying to set up a multilayer perceptron for binary classification using tensorflow.

I have a very large dataset (about 1,5*10^6 examples) each with a binary (0/1) label and 100 features. What I need to do is to set up a simple MLP and then try to change the learning rate and the initialization pattern to document the results (it's an assignment). I am getting strange results, though, as my MLP seem to get stuck with a low-but-not-great cost early and never getting off of it. With fairly low values of learning rate the cost goes NAN almost immediately. I don't know if the problem lies in how I structured the MLP (I did a few tries, going to post the code for the last one) or if I am missing something with my tensorflow implementation.

CODE

import tensorflow as tf
import numpy as np
import scipy.io

# Import and transform dataset
print("Importing dataset.")
dataset = scipy.io.mmread('tfidf_tsvd.mtx')

with open('labels.txt') as f:
    all_labels = f.readlines()

all_labels = np.asarray(all_labels)
all_labels = all_labels.reshape((1498271,1))

# Split dataset into training (66%) and test (33%) set
training_set    = dataset[0:1000000]
training_labels = all_labels[0:1000000]
test_set        = dataset[1000000:1498272]
test_labels     = all_labels[1000000:1498272]

print("Dataset ready.") 

# Parameters
learning_rate   = 0.01 #argv
mini_batch_size = 100
training_epochs = 10000
display_step    = 500

# Network Parameters
n_hidden_1  = 64    # 1st hidden layer of neurons
n_hidden_2  = 32    # 2nd hidden layer of neurons
n_hidden_3  = 16    # 3rd hidden layer of neurons
n_input     = 100   # number of features after LSA

# Tensorflow Graph input
x = tf.placeholder(tf.float64, shape=[None, n_input], name="x-data")
y = tf.placeholder(tf.float64, shape=[None, 1], name="y-labels")

print("Creating model.")

# Create model
def multilayer_perceptron(x, weights):
    # First hidden layer with SIGMOID activation
    layer_1 = tf.matmul(x, weights['h1'])
    layer_1 = tf.nn.sigmoid(layer_1)
    # Second hidden layer with SIGMOID activation
    layer_2 = tf.matmul(layer_1, weights['h2'])
    layer_2 = tf.nn.sigmoid(layer_2)
    # Third hidden layer with SIGMOID activation
    layer_3 = tf.matmul(layer_2, weights['h3'])
    layer_3 = tf.nn.sigmoid(layer_3)
    # Output layer with SIGMOID activation
    out_layer = tf.matmul(layer_3, weights['out'])
    out_layer = tf.nn.sigmoid(out_layer)
    return out_layer

# Layer weights, should change them to see results
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], dtype=np.float64)),       
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], dtype=np.float64)),
    'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],dtype=np.float64)),
    'out': tf.Variable(tf.random_normal([n_hidden_3, 1], dtype=np.float64))
}

# Construct model
pred = multilayer_perceptron(x, weights)

# Define loss and optimizer
cost = tf.nn.l2_loss(pred-y,name="squared_error_cost")
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initializing the variables
init = tf.initialize_all_variables()

print("Model ready.")

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    print("Starting Training.")

    # Training cycle
    for epoch in range(training_epochs):
        #avg_cost = 0.
        # minibatch loading
        minibatch_x = training_set[mini_batch_size*epoch:mini_batch_size*(epoch+1)]
        minibatch_y = training_labels[mini_batch_size*epoch:mini_batch_size*(epoch+1)]
        # Run optimization op (backprop) and cost op
        _, c = sess.run([optimizer, cost], feed_dict={x: minibatch_x, y: minibatch_y})

        # Compute average loss
        avg_cost = c / (minibatch_x.shape[0])

        # Display logs per epoch
        if (epoch) % display_step == 0:
        print("Epoch:", '%05d' % (epoch), "Training error=", "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

    # Test model
    # Calculate accuracy
    test_error = tf.nn.l2_loss(pred-y,name="squared_error_test_cost")/test_set.shape[0]
    print("Test Error:", test_error.eval({x: test_set, y: test_labels}))

OUTPUT

python nn.py
Importing dataset.
Dataset ready.
Creating model.
Model ready.
Epoch: 00000 Training error= 0.110878121
Epoch: 00500 Training error= 0.119393080
Epoch: 01000 Training error= 0.109229532
Epoch: 01500 Training error= 0.100436962
Epoch: 02000 Training error= 0.113160662
Epoch: 02500 Training error= 0.114200962
Epoch: 03000 Training error= 0.109777990
Epoch: 03500 Training error= 0.108218725
Epoch: 04000 Training error= 0.103001394
Epoch: 04500 Training error= 0.084145737
Epoch: 05000 Training error= 0.119173495
Epoch: 05500 Training error= 0.095796251
Epoch: 06000 Training error= 0.093336573
Epoch: 06500 Training error= 0.085062860
Epoch: 07000 Training error= 0.104251661
Epoch: 07500 Training error= 0.105910949
Epoch: 08000 Training error= 0.090347288
Epoch: 08500 Training error= 0.124480612
Epoch: 09000 Training error= 0.109250224
Epoch: 09500 Training error= 0.100245836
Optimization Finished!
Test Error: 0.110234139674

score 0 · Answer 1 · answered Oct 02 '16 at 16:10

First, you might try filtering the 100 features down to a lower number as many of them may not be predictive of outcome(0,1). So maybe employ a chi-squared test of two proportions ($p_1$ for the proportion of ones in the output and $p_2$ for the proportion of ones in each feature). Thus, you will have 100 chi-squared tests. Then, only use features whose p-values are not significant, since you want $p_2$ to be similar to $p_1$, not significantly different.

In spite of using dummy indicator variables in regression to acquire mean change of $y$ for a one-unit change in $x$, artificial neural networks (ANNs) don't always work well with purely binary or Boolean data, since there are a lot of partial derivatives of network error w.r.t to weight training from hidden layer outputs (output-side) and between network error and input-side coefficients. Depending on the output-side transformation being used (softmax, linear) and activation functions (tanh, logistic, linear, RBF) many ANNs expect input features with values in the range [-1,1]. So maybe try to rescale the input feature values of [0,1] to [-1,1], and see how the results compare.

Certainly, don't simply throw all 100 features into an ANN, since there may be features that are not predictive of class. Such features will be useless and degrade the learning rate of the ANN. Also, try using the softmax function on the output-side, and either the linear or logistic activation function on the hidden layer (input-side). An ANN is like an engine: if the right combination of gas-air-spark is not used, it won't run. Filtering out bad features as a first step (whose values don't predict outcome singly, i.e. univariately), will be like increasing the octane of the fuel used.

Hey, thank you for your answer. The 100 features are what I've obtained from the LSA algorithm applied to a datased who had more than 29000 features (a dictionary). Do you think I should decrease the number of features, then? — Darkobra, Oct 02 '16 at 17:10
Yes, you could run correspondence analysis on the features (without the outcome variable) to try to reduce the dimensions down to 10-30. — , Oct 02 '16 at 17:20
Given the number of example in OP's dataset, the number of features of 100 is not shocking. — tagoma, Mar 29 '17 at 19:39

Setting up a MLP for binary classification with tensorflow

CODE

OUTPUT

1 Answers1