Newbie to neural networks

Question

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."

My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).

In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).

I have 9000 records of data spanning many years/students. Here are my questions:

1) How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?

2) If I split the data into an even number, say 9$\times$1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?

3) Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?

4) I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?

E.g.

750-800 = 10
700-740 = 9

etc.

Is there any benefit to doing this or should I just go ahead and try to predict the exact score?

What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?

5a) I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?

5b) What about -1(below 600), 0(exactly 600), 1(above 600), would this work?

5c) Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?

5d) Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?

6a) What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).

6b) What is the difference between log-sigmoid and zero-based log-sigmoid?

If your knowledge of machine learning or statistics is low I suggest you to avoid neural network and to use a simpler tool. For example random forest. — Donbeo, Sep 07 '14 at 10:37
I agree that he should avoid neural networks altogether but I don't consider random forest to be a simpler tool. I would recommend k-nearest neighbours or Naive Bayes. — Digio, Aug 20 '15 at 08:49
Even thought this question is interesting, there is one fundamental problem with this question is that it contains too many question and thus should be flagged as too broad. So what about that? — eliasah, Oct 22 '15 at 05:57

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

1) How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?

I would recommend that you read about cross-validation here. The concept is not mathematically heavy. The below figure may be useful to illustrate: enter image description here

With regards to the selection of your subsets for cross-validation, I suspect the 9000 samples span quite a few years, over which the effects of different random variables on your outcome may have changed. I would suggest that you pay attention that each sub-set is homogeneously spread-out along the interval, rather than the first subset consisting exclusively of 1999 records, whereas the final one only contains 2014 records.

2) If I split the data into an even number, say 9×1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?

After reading the cross-validation link above, I recommend that you have a look at @Dikran Marsupial's answer here. The summary is you should use the full data set to come up with the final model.

3) Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?

This is a good question and has been answered here.

4) I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?

I do not see any added value in discretising your outcome variable in this way. One thing to be careful about though is that if you do this, your different ranges will be ordinal. So, category 3 will be closer to category 5 than category 6. I am not sure about neural nets, but if you were learning logistic regression, you may want to train an ordered logistic regression model instead of a general one, which assumes the outcome variable is nominal (e.g. pass/fail, blue/red/green).

5b) What about -1(below 600), 0(exactly 600), 1(above 600), would this work?

I would advise against this, you will see that exactly 600 almost never happens so the probability of getting a score of exactly 600 will be close to zero. Also, while training your model, you will not be able to find many examples with an exact score of 600 so your model most likely will not be reliable. If 600 is the pass/fail threshold, you can just train it as a binary classifier.

score 0 · Answer 2 · answered Sep 07 '14 at 14:56

0

I'd recommend you do Andrew Ng's upcoming Coursera course https://www.coursera.org/course/ml I think most, if not all of your questions will be covered in this course. It'll also get you away from using Neuro XL - after all, what will you do after the trial period runs out?

answered Sep 07 '14 at 14:56

babelproofreader

4,544
4
22
35

I will definitely take a look at the course, but his course also requires some programming. NeuroXL is only $99 once it expires, so if I can understand some basics with the tool, it will work for me. – FTX Sep 07 '14 at 16:55
There is plenty of free, state-of-the-art NN software, such as `pylearn2` and `Theano`. – Marc Claesen Jan 21 '15 at 20:28

score 0 · Answer 3 · answered Sep 08 '14 at 13:35

Instead of splitting into 9 groups of 1,000 it's better to split into 9 groups of 8,000 where each sample is missing from exactly one of the nine groups. Then you can test each sample on the other 8 networks for which it was not used as a training example.
I'm not sure that it would be more accurate this way.

5a. Some neural networks can't output anything higher than 1.0 or less than 0. It's easier for the network to output a 0.9 instead. This may not apply to your case depending on how the network is set up.

5b. I'm not sure that the exactly 600 category would happen enough for there to be enough training examples. But you could certainly create three different ranges, i.e. below 550, 550-650 and above 650

5c. It will output fractions. You can round to the nearest one. Sometimes you set it up to have multiple output neurons corresponding to the different output categories and you just choose the one with the highest response.

Newbie to neural networks

3 Answers3