0

I tried to build a neural net for learning XOR. The design is as follows:

1st layer: compute linear function of input 4:2 with 2:2 weights and adding 1:2 bias.
2nd layer: apply sigmoid to all elements in 4:2 matrix from layer 1.
3rd layer: compute linear function of the 2nd layer's output with 2:1 weights
final layer: apply sigmoid to 4:1 vector from the previous layer

The model converges well after as much as 1,000,000 of iterations (I don't know if it's too big of a number for such a simple problem.

I get the following plot (X - iterations, Y - cost). It seems a little bit strange. Why it stays on the same cost value for so long? Could anyone explain is it a normal graph for such a problem and is anything wrong with my setup that causes such cost behavior? Thanks enter image description here

kirgol
  • 115
  • 4

1 Answers1

0

Gradient descent isn't a great optimizer. It's not unusual for GD to reach a plateau of very slow progress before finding a good descent direction again.

See this post for a comparison of GD and Levenberg-Marquardt for a simple linear regression network with poor conditioning.

See this post for an explanation of why poor conditioning makes gradient descent challenging.

In the case of this particular task, it's true that a small, single-layer network with sigmoid neurons can solve the XOR task. However, there's a difference between the theoretical possibility of solving a task and whether or not it's practical to train a network to do so. One reason that networks are wider and deeper and use tricks like residual layers, batch norm, sophisticated optimizers and ReLUs is to make these networks easier to train using gradient descent.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • thank you for your response and links. I am aware of some limitations of GD but as it's very popular and I used it I just wanted to know if it's something wrong specifically with my setup and if this graph pattern indicates some issues – kirgol Dec 12 '19 at 19:01
  • I wouldn't be surprised if using cross-entropy loss improved the training time of the network because it has steeper gradients. – Sycorax Dec 12 '19 at 19:02
  • Yes, I've heard about it, but haven't yet implemented. Starting from scratch from the simplest things. – kirgol Dec 12 '19 at 19:03