2

I'm doing a course on CNN by Andrew Ng. and in one of the lectures he said that due to Parameter Sharing and Sparsity of Connections in CNN it has fewer parameters which enables it to be trained with smaller training sets and also makes it less prone to overfitting.

As per the second part ie. makes it less prone to overfitting, I think it's because having less parameters makes the decision boundary less complex as compared to one with more parameters. My conclusion, 2 models with same number of layers, the one with more hidden units will make more complex Decision Boundary as it has more non-linear activation functions and hence will be more prone to overfitting.

But I don't understand, why it can be trained better than a standard NN if both are trained on small datasets.

Any help is highly appreciated.

Adarsh Kumar
  • 112
  • 7

1 Answers1

2

So it all boils down to the number of parameters in a certain network.

More parameters means a higher capacity for a model, i.e. it can approximate more complex functions (or have more complex decision boundaries as you say). On the other hand, less parameters means a lower capacity for the model. The problem that, ideally, you want the model to have just the right capacity to model all useful aspects of the data, while not having enough capacity to model the noise in the data.

In the present case, if we have two models a CNN and a Fully-Connected (FC) NN, the latter has many more parameters and thus a higher capacity. However, if the CNN is capable of solving the problem, the more complex FC network is more prone to overfit (because it has a higher capacity and can model the underlying noise).

You can also think of it like this. A sufficiently-high capacity network has the ability to memorize datasets (i.e. learn every single one of training samples without having the capability to generalize). FC networks, due to the fact that they have more parameters, are more prone to this than CNNs.

Now, the last part has to do with the size of the dataset. Smaller datasets are easier to memorize (and thus more prone to overfitting on), while larger ones are harder. I mentioned previously that FC networks can memorize datasets; this easier in smaller datasets. If fact, you can expect a FC network to most-certainly overfit on small datasets.

For more on generalization, I'd recommend reading this post, where I analyze generalization in a bit more detailm

Djib2011
  • 5,395
  • 5
  • 25
  • 36