How to choose the number of hidden layers and nodes in a feedforward neural network?

Question

Is there a standard and accepted method for selecting the number of layers, and the number of nodes in each layer, in a feed-forward neural network? I'm interested in automated ways of building neural networks.

Among all the great answers, i found this paper helpful http://dstath.users.uth.gr/papers/IJRS2009_Stathakis.pdf — Debpriya Seal, Jun 24 '17 at 01:27
Come to party a bit late. But this is still an open research topic as part of AutoML or AutoDL and NAS (Neural Architecture Search). There is no universal answer for this question yet. — msuzen, Jun 25 '21 at 10:23

score 611 · Accepted Answer · edited Jul 22 '18 at 18:44

I realize this question has been answered, but I don't think the extant answer really engages the question beyond pointing to a link generally related to the question's subject matter. In particular, the link describes one technique for programmatic network configuration, but that is not a "[a] standard and accepted method" for network configuration.

By following a small set of clear rules, one can programmatically set a competent network architecture (i.e., the number and type of neuronal layers and the number of neurons comprising each layer). Following this schema this will give you a competent architecture but probably not an optimal one.

But once this network is initialized, you can iteratively tune the configuration during training using a number of ancillary algorithms; one family of these works by pruning nodes based on (small) values of the weight vector after a certain number of training epochs--in other words, eliminating unnecessary/redundant nodes (more on this below).

So every NN has three types of layers: input, hidden, and output.

Creating the NN architecture therefore means coming up with values for the number of layers of each type and the number of nodes in each of these layers.

The Input Layer

Simple--every NN has exactly one of them--no exceptions that I'm aware of.

With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some NN configurations add one additional node for a bias term.

The Output Layer

Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.

Is your NN going running in Machine Mode or Regression Mode (the ML convention of using a term that is also used in statistics but assigning a different meaning to it is very confusing). Machine mode: returns a class label (e.g., "Premium Account"/"Basic Account"). Regression Mode returns a value (e.g., price).

If the NN is a regressor, then the output layer has a single node.

If the NN is a classifier, then it also has a single node unless softmax is used in which case the output layer has one node per class label in your model.

The Hidden Layers

So those few rules set the number of layers and size (neurons/layer) for both the input and output layers. That leaves the hidden layers.

How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. Of course, you don't need an NN to resolve your data either, but it will still do the job.

Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very few. One hidden layer is sufficient for the large majority of problems.

So what about size of the hidden layer(s)--how many neurons? There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'. Jeff Heaton, author of Introduction to Neural Networks in Java offers a few more.

In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.

Optimization of the Network Configuration

Pruning describes a set of techniques to trim network size (by nodes not layers) to improve computational performance and sometimes resolution performance. The gist of these techniques is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.

Put another way, by applying a pruning algorithm to your network during training, you can approach optimal network configuration; whether you can do that in a single "up-front" (such as a genetic-algorithm-based algorithm) I don't know, though I do know that for now, this two-step optimization is more common.

You state that for the majority of problems need only one hidden layer. Perhaps it is better to say that NNs with more hidden layers are extremly hard to train (if you want to know how, check the publications of Hinton's group at Uof Toronto, "deep learning") and thus those problems that require more than a hidden layer are considered "non solvable" by neural networks. — bayerj, Jul 12 '11 at 12:50
@bayerj, what is the extra benefit from the 'deep learning' networks? — Vass, Feb 12 '12 at 18:06
The same functions can be represented with exponentially less parameters, leading to better generalization. — bayerj, Feb 14 '12 at 18:10
You write *If the NN is a regressor, then the output layer has a single node.*. Why only a single node? Why can't I have multiple continuous outputs? — gerrit, Nov 12 '12 at 15:46
@gerrit You can definitely have multiple continuous outputs if your target output is vector-valued. Defining an appropriate loss function for vector-valued outputs can be a bit trickier than with one output though. — lmjohns3, Aug 29 '13 at 23:23
Where trickier means: you basically sum the individual errors. Yet, they might need to be weighted, otherwise one error dominates everything else (e.g. because it has a much larger range). ZScores on the training outputs is one obvious way to do this, but might not be what you want. — bayerj, Aug 30 '13 at 06:54
I thought it was the opposite than this: _If the NN is a classifier, then it also has a single node unless softmax is used in which case the output layer has one node per class label in your model._ — dawid, Jan 17 '14 at 01:04
You could construct a neural network with multiple input layers if you wanted to. The first layer would have to be an input layer, but successive layers could be comprised of both hidden neurons and input neurons, or, if you will, you can have an additional input layer on the same level as a hidden layer. — HelloGoodbye, Jan 11 '16 at 21:17
@doug Thank you for this wonderful answer. This allowed me to reduce my ANN from 3 hidden layers down to 1 and achieve the same classification accuracy by setting the right number of hidden neurons... I just used the average of the input and output summed together. Thanks! — rayryeng, Feb 07 '17 at 21:58
@davips No, it is written correctly here. [Softmax](https://en.wikipedia.org/wiki/Softmax_function) works similar to a sigmoid or other map to [-1,1] space, except that it maps to multiple dimensions/outputs, just as Doug mentions. — Mike Williamson, Aug 04 '17 at 15:46
@MikeWilliamson I suspect its a terminology/use "issue" with a lot of the common ML frameworks. It seems common to have a bunch of "output" neurons, then you softmax them, and then you pick the highest one as the single output; and people incorrectly lump the "pick highest" into softmax. — mbrig, Nov 23 '17 at 17:04
"Simple--every NN has exactly one of them--no exceptions that I'm aware of." Not true, here's an example in the Keras documentation of two input (and two output) layers: https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models — user1993951, Mar 23 '18 at 19:21
Does the answer change in any way if the MLP is for approximating the Q function in a RL task? — hipoglucido, Apr 25 '18 at 11:58
Do someone have a paper or a book, where the fact that "One hidden layer is sufficient for the large majority of problems." could be proved? Or cited? Thank you! — ZelelB, Feb 06 '19 at 12:27
This paper argues that if you don't have any hidden layer wider than the input layer, you won't be able to form disconnected decision regions in the input space. So, I think it should be beneficial to experiment with that instead of just the average between the input and output layer. https://arxiv.org/abs/1803.00094 — Ahmed Maher, May 01 '20 at 11:00
Thanks for this nice answer!! So let us consider the following example: Model1) 500 input units, 100 hidden units, 200 output units; Model2) 500 input units, 100 hidden units, 5 hidden units (second hidden layer), 200 output units. The benefit of Model A compared to Model B is: Faster training, lower generelization error (?) and the benefit of Model B compared to Model A: due to the high variance in the number of units, I'm a bit unsure how to put the effect into words. Can you think of an advantage of this model (compared to model A)? — tubmaster, May 22 '20 at 11:29
This answer was written in August of 2010. The field of Machine Learning has experienced much progress since then. I'd value an updated answer to this question, especially insofar it relates to choosing the number of hidden layers and their parameters. — Tamay, Jun 24 '21 at 13:33

score 186 · Answer 2 · edited May 05 '21 at 14:03

186

@doug's answer has worked for me. There's one additional rule of thumb that helps for supervised learning problems. You can usually prevent over-fitting if you keep your number of neurons below:

$$N_h = \frac{N_s} {(\alpha * (N_i + N_o))}$$

$N_i$ = number of input neurons.
$N_o$ = number of output neurons.
$N_s$ = number of samples in training data set.
$\alpha$ = an arbitrary scaling factor usually 2-10.

Others recommend setting $\alpha$ to a value between 5 and 10, but I find a value of 2 will often work without overfitting. You can think of $\alpha$ as the effective branching factor or number of nonzero weights for each neuron. Dropout layers will bring the "effective" branching factor way down from the actual mean branching factor for your network.

As explained by this excellent NN Design text, you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The degrees of freedom in your data is the number samples * degrees of freedom (dimensions) in each sample or $N_s * (N_i + N_o)$ (assuming they're all independent). So $\alpha$ is a way to indicate how general you want your model to be, or how much you want to prevent overfitting.

For an automated procedure you'd start with an $\alpha$ of 2 (twice as many degrees of freedom in your training data as your model) and work your way up to 10 if the error (loss) for your training dataset is significantly smaller than for your test dataset.

edited May 05 '21 at 14:03

Ari Cooper-Davis

103
4

answered Feb 06 '15 at 07:22

hobs

2,025
1
12
11

12

This formula is very interesting and helpful. Is there is any reference for this formula? It would be more helpful. – prashanth Feb 16 '16 at 14:14
2

@prashanth I combined several assertions and formulas in the NN Design text referenced above. But I don't think it's explicitly called out in the form I show. And my version is a very crude approximation with a lot of simplifying assumptions. So YMMV. – hobs Feb 16 '16 at 17:30
I don't see how training set size is relevant to this. What if your test set becomes larger later? Besides you want something that generalizes not something that can fit your training data. – kon psych Feb 22 '16 at 06:50
@konpsych If your training set grows, and your model isn't fitting those new examples well, that's an opportunity to increase the number of neurons. However, you're right, theoretically. If you've developed a highly effective regularization approach (e.g. random dropout) and can generalize from a few examples to many test examples, then the training set DOF can be much less than your NN DOF. But I couldn't implement such an efficient "generalization engine" myself. The lowest I could go on "alpha" was a value of 2. – hobs Feb 22 '16 at 14:51
1

First I wanted to write training set instead of test set in previous comment. Maybe this formula makes sense if we are to read it as "you need at least that many neurons to learn enough features (the DOF you mentioned) from dataset". If the features of dataset are representative of population and how well the model can generalize maybe it's a different question (but an important one). – kon psych Feb 22 '16 at 22:07
1

Yes @konpsych. I assumed a representative training and test set, so yes, my mention of generalization doesn't make sense for discrete inputs, but does make sense for real valued input and output (regression) where generalization can happen in the space between representative samples. Your wording is better. As an approximate "rule of thumb", this formula worked for me on my regression problems, ensuring my models didn't "memorize" the input/output correspondence. – hobs Nov 20 '16 at 20:19
@hobs What happens when you a big trainning set, something like 5 millions of rows? – Cyberguille Jan 24 '17 at 22:30
@Cyberguille A **lot** happens ;) Start with as few degrees of freedom as you think the phenomenon you are modeling requires. Work your way up to the "limit" in this formula. You can expand your model "capacity" or degrees of freedom by adding more neurons, layers, weights (connections) in order to improve performance. Continue to "regularize" your model with random dropouts, stochastic activation functions, weight cost, etc., especially as you approach or exceed the "limit" in this rule of thumb. – hobs Jan 25 '17 at 01:15
@hobs But the time complexity increase when you add more neurons, maybe I need execute this in a good computer and many time. So time complexity depends of the input(length of training set), that's not good. Maybe it works for no so big training set. – Cyberguille Jan 26 '17 at 16:41
1

@Cyberguille Yes. You can't optimize for both accuracy and complexity simultaneously. The **upper** limit doesn't assume anything about resource constraints. Let me know if you find a good rule of thumb that balances computational limitations as well as accuracy for you. – hobs Jan 27 '17 at 00:00
1

This formula implies that if the number of training samples is really large, e.g. 1 million, then the number of hidden nodes would be in the order of hundreds of thousands. Is that really advisable? – Hamman Samuel Feb 11 '17 at 14:44
1

Hello, formula is interesting, what if I have multiple networks and multiple inputs but single outputs. For example BiDAF network or Siamese Network. Hidden layers should be calculated separately for each network or one size should be applied to all the networks by taking mean. – Wazzzy May 08 '17 at 06:22
1

@Wazzzy definitely calculate for individual networks. And this formula is just a very rough starting point. Real NNs with state-of-the-art performance are able to expand their degrees of freedom (number of nonzero weights) to much greater than what their dataset would normally support (with this formula) by using other regularization approaches, like "random dropout", to spread the learning around the network. – hobs May 08 '17 at 17:02
4

Are you sure this is a good estimate for networks with more than one hidden layer? Isn't it the case than for multiple hidden layers the number of parameters is much greater than $N_h \cdot (N_i + N_o)$? – Mateusz May 24 '17 at 22:37
1

Yes, @mateus, you'd have to iteratively apply this rule-of-thumb to each hidden layer if you want a better estimate. But perhaps the interdependence of the weights in each layer reduces the growth rate in the variety/complexity of models as layers are added. Even if iteratively applied to each layer, this rule-of-thumb is less and less useful as you add more layers. – hobs May 25 '17 at 19:33
3

@mateus, perhaps a slightly better rule of thumb for multiple layers is the `N_h` (average number of hidden neurons per layer) solution to this `N_s = (N_i + N_o) * N_h ^ N_hidden_layers`. But I still wouldn't use this formula. It's only for very basic problems (toy problems) when you don't plan to implement any other regularization approaches. – hobs May 25 '17 at 19:40
1

@hobs, if I understand your formula correctly for 400.000 samples in the training set, that have 4 features for doing binary classification I would need: `400000 / (10 * (4 + 1)) == 8000` ~8000 neurons in the hidden layer - is this correct estimation? – MaxU - stop WAR against UA Mar 19 '18 at 13:48
1

@MaxU Yea that's right. That's a good number to start with, but that's a lot of neurons and a lot of weights for only 4 degrees of freedom in each sample. So if your data has a lot of redundant samples in it you might find this is too many neurons. You can use PCA on subsets of your data (treating each sample as a new dimension in your PCA) to estimate the linear redundancy in your data. Let us know how it works out for you. – hobs Mar 19 '18 at 21:46
1

@hobs do you mean with "hidden neurons", the total hidden neurons in all the hidden layers of the NN, or just in one hidden layer? I mean, let's say I have 2 hidden layers, and Nh = 50 for me. Does that mean that I could (should) have 50 hidden neurons in the first hidden layer, and 50 hidden neurons in the second, or together they should be 50? like 25 in the first and 25 in the second? – ZelelB Jan 16 '19 at 19:24
@ZelelB This rule of thumb is horribly inaccurate. I no longer use this formula for anything except a single fully-connected hidden layer. So if you have a multi-layer network or you use a CNN or RNN you have to limit the total number of neurons and weights based on intuition or somebody else's guidance. Your best bet is to use scholar.google.com to find out the number of neurons and layers (and weights or DOF) used successfully by others. – hobs Jan 17 '19 at 20:50
@HammanSamuel I don't know. It depends on how much random dropout you use, and the nature of the underlying "physics" of the relationship between your data and your target variable and how much computational horsepower and time you're willing to spend on a problem. Always start with the simplest possible network that you can imagine ever working on your problem. This is a very crude upper limit, not a recommended starting point for network size or architecture (though I know that's what the OP asked). – hobs Jan 17 '19 at 20:56
1

I'm training a network to take a 50x50 pixel image and predict the bounds of a shape that appears in it. With 6,000 examples in my training set and an value of just 2, N_h = 6000/(2*(2500+4)) = 1.20. So I could have 1.20 neurons in my network? Either I'm missing something or this doesn't scale well. – spaaarky21 Feb 04 '20 at 08:32
@spaarky21 yes, it's for fully-connected (Dense) ANNs not CNNs. Dense networks don't work well for most CV problems. 1 neuron might be appropriate for your problem if you think a Dense NN can solve it. It's certainly a good place to start. 1 neuron is equivalent to a logistic regression. – hobs Feb 07 '20 at 00:29

JfredoJ · Answer 3 · 2018-05-31T01:05:52.543

From Introduction to Neural Networks for Java (second edition) by Jeff Heaton - preview freely available at Google Books and previously at author's website:

The Number of Hidden Layers

There are really two decisions that must be made regarding the hidden layers: how many hidden layers to actually have in the neural network and how many neurons will be in each of these layers. We will first examine how to determine the number of hidden layers to use with the neural network.

Problems that require two hidden layers are rarely encountered. However, neural networks with two hidden layers can represent functions with any kind of shape. There is currently no theoretical reason to use neural networks with any more than two hidden layers. In fact, for many practical problems, there is no reason to use any more than one hidden layer. Table 5.1 summarizes the capabilities of neural network architectures with various hidden layers.

Table 5.1: Determining the Number of Hidden Layers
| Number of Hidden Layers | Result |

 0 - Only capable of representing linear separable functions or decisions.

 1 - Can approximate any function that contains a continuous mapping
from one finite space to another.

 2 - Can represent an arbitrary decision boundary to arbitrary accuracy
with rational activation functions and can approximate any smooth
mapping to any accuracy.
Deciding the number of hidden neuron layers is only a small part of the problem. You must also determine how many neurons will be in each of these hidden layers. This process is covered in the next section.

The Number of Neurons in the Hidden Layers

Deciding the number of neurons in the hidden layers is a very important part of deciding your overall neural network architecture. Though these layers do not directly interact with the external environment, they have a tremendous influence on the final output. Both the number of hidden layers and the number of neurons in each of these hidden layers must be carefully considered.

Using too few neurons in the hidden layers will result in something called underfitting. Underfitting occurs when there are too few neurons in the hidden layers to adequately detect the signals in a complicated data set.

Using too many neurons in the hidden layers can result in several problems. First, too many neurons in the hidden layers may result in overfitting. Overfitting occurs when the neural network has so much information processing capacity that the limited amount of information contained in the training set is not enough to train all of the neurons in the hidden layers. A second problem can occur even when the training data is sufficient. An inordinately large number of neurons in the hidden layers can increase the time it takes to train the network. The amount of training time can increase to the point that it is impossible to adequately train the neural network. Obviously, some compromise must be reached between too many and too few neurons in the hidden layers.

There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:

The number of hidden neurons should be between the size of the input layer and the size of the output layer.

The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.

The number of hidden neurons should be less than twice the size of the input layer.

These three rules provide a starting point for you to consider. Ultimately, the selection of an architecture for your neural network will come down to trial and error. But what exactly is meant by trial and error? You do not want to start throwing random numbers of layers and neurons at your network. To do so would be very time consuming. Chapter 8, “Pruning a Neural Network” will explore various ways to determine an optimal structure for a neural network.

I also like the following snippet from an answer I found at researchgate.net, which conveys a lot in just a few words:

Steffen B Petersen · Aalborg University

[...]

In order to secure the ability of the network to generalize the number of nodes has to be kept as low as possible. If you have a large excess of nodes, you network becomes a memory bank that can recall the training set to perfection, but does not perform well on samples that was not part of the training set.

Do you happen to know the source of the quote of Steffen B Petersen? — Sebastian Nielsen, Oct 06 '18 at 10:19
I am sorry I don't. I tried searching for it but I couldn't find it... I think the article has been removed from the web. Maybe you can contact him directly? — JfredoJ, Oct 09 '18 at 22:28
Shouldn't the size of the training set be taken into account? I have a tabular dataset with ~300,000 unique samples (car prices). The input layer has 89 nodes. Training a network with no regularization and only 89 nodes in a single hidden layer, I get the training loss to plateau after a few epochs. RMSE plateaus at ~$1,800 (single output node is price in this regression problem). — rodrigo-silveira, Jul 09 '19 at 02:48
I think the source of the quote by Steffen B Petersen was here: https://www.researchgate.net/post/How_to_decide_the_number_of_hidden_layers_and_nodes_in_a_hidden_layer — TripleAntigen, Sep 26 '19 at 08:59
Though this answer has mostly copied stuff from a book whose reference is given in answer but it is somehow answering the question. — Rahul Jha, Aug 16 '20 at 15:28
The universal approximation theorem states that a 2 hidden layers neural network can approximate any function, provided "arbitrary width" and suitable activation functions (both are unavailable and might be hard to train in practice). The reason to not limit the neural network number of hidden layers into 2 is because one would need an extremely wide network. But if one also limit its width between input size and output size then the neural network would also works poorly in practices. — Minh Khôi, Mar 27 '21 at 08:46

score 48 · Answer 4 · answered Sep 03 '10 at 08:40

I am working on an empirical study of this at the moment (approching a processor-century of simulations on our HPC facility!). My advice would be to use a "large" network and regularisation, if you use regularisation then the network architecture becomes less important (provided it is large enough to represent the underlying function we want to capture), but you do need to tune the regularisation parameter properly.

One of the problems with architecture selection is that it is a discrete, rather than continuous, control of the complexity of the model, and therefore can be a bit of a blunt instrument, especially when the ideal complexity is low.

However, this is all subject to the "no free lunch" theorems, while regularisation is effective in most cases, there will always be cases where architecture selection works better, and the only way to find out if that is true of the problem at hand is to try both approaches and cross-validate.

If I were to build an automated neural network builder, I would use Radford Neal's Hybrid Monte Carlo (HMC) sampling-based Bayesian approach, and use a large network and integrate over the weights rather than optimise the weights of a single network. However that is computationally expensive and a bit of a "black art", but the results Prof. Neal achieves suggests it is worth it!

"I am working on an empirical study of this at the moment" - Is there any update? — Martin Thoma, Apr 24 '17 at 12:16
no, 'fraid not, I'd still recommend large(ish) network and regularisation, but there is no silver bullet, some problems don't need regularisation, but some datasets need hidden layer size tuning as well as regularisation. Sadly reviewers didn't like the paper :-( — Dikran Marsupial, Apr 24 '17 at 13:07

score 18 · Answer 5 · edited Sep 07 '17 at 07:13

18

• Number of hidden nodes: There is no magic formula for selecting the optimum number of hidden neurons. However, some thumb rules are available for calculating the number of hidden neurons. A rough approximation can be obtained by the geometric pyramid rule proposed by Masters (1993). For a three layer network with n input and m output neurons, the hidden layer would have $\sqrt{n*m}$ neurons.

Ref:

1 Masters, Timothy. Practical neural network recipes in C++. Morgan Kaufmann, 1993.

[2] http://www.iitbhu.ac.in/faculty/min/rajesh-rai/NMEICT-Slope/lecture/c14/l1.html

edited Sep 07 '17 at 07:13

Ferdi

4,882
7
42
62

answered Feb 16 '16 at 14:17

prashanth

3,747
4
21
33

This rule seemed to work quite well with a number of different data sets and three hidden layers. One thing is for certain, using this rule, the number of neurons in the hidden layer(s) will be less than the number of input features (size $n$). – Jun 02 '19 at 20:38
If I am using 15 input parameter which will finally return one output so as per your formula, hidden later neurons would be only 3-4 for this model. Is that so. – Rahul Jha Aug 16 '20 at 15:48
@RahulJha yes as per the rule. But take the result with a pinch of salt. Its just one of the many options available. Go through other answers as well. – prashanth Aug 18 '20 at 09:37
it is still a mystery for me. – Rahul Jha Aug 18 '20 at 13:04

score 16 · Answer 6 · edited Feb 20 '17 at 16:03

As far as I know there is no way to select automatically the number of layers and neurons in each layer. But there are networks that can build automatically their topology, like EANN (Evolutionary Artificial Neural Networks, which use Genetic Algorithms to evolved the topology).

There are several approaches, a more or less modern one that seemed to give good results was NEAT (Neuro Evolution of Augmented Topologies).

score 9 · Answer 7 · answered Oct 13 '17 at 07:06

Sorry I can't post a comment yet so please bear with me. Anyway, I bumped into this discussion thread which reminded me of a paper I had seen very recently. I think it might be of interest to folks participating here:

AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, Scott Yang ; Proceedings of the 34th International Conference on Machine Learning, PMLR 70:874-883, 2017.

Abstract We present a new framework for analyzing and learning artificial neural networks. Our approach simultaneously and adaptively learns both the structure of the network as well as its weights. The methodology is based upon and accompanied by strong data-dependent theoretical learning guarantees, so that the final network architecture provably adapts to the complexity of any given problem.

score 8 · Answer 8 · edited Apr 13 '17 at 12:44

Automated ways of building neural networks using global hyper-parameter search:

Input and output layers are fixed size.

What can vary:

the number of layers
number of neurons in each layer
the type of layer

Multiple methods can be used for this discrete optimization problem, with the network out of sample error as the cost function.

1) Grid / random search over the parameter space, to start from a slightly better position
2) Plenty of methods that could be used for finding the optimal architecture. (Yes, it takes time).
3) Do some regularization, rinse, repeat.

score 8 · Answer 9 · answered Aug 01 '17 at 06:39

8

I've listed many ways of topology learning in my masters thesis, chapter 3. The big categories are:

Growing approaches
Pruning approaches
Genetic approaches
Reinforcement Learning
Convolutional Neural Fabrics

answered Aug 01 '17 at 06:39

Martin Thoma

1,449
1
17
30

score 6 · Answer 10 · edited Dec 16 '17 at 12:38

I'd like to suggest a less common but super effective method.

Basically, you can leverage a set of algorithms called "genetic algorithms" that try a small subset of the potential options (random number of layers and nodes per layer). It then treats this population of options as "parents" that create children by combining/ mutating one or more of the parents much like organisms evolve. The best children and some random ok children are kept in each generation and over generations, the fittest survive.

For ~100 or fewer parameters (such as the choice of the number of layers, types of layers, and the number of neurons per layer), this method is super effective. Use it by creating a number of potential network architectures for each generation and training them partially till the learning curve can be estimated (100-10k mini-batches typically depending on many parameters). After a few generations, you may want to consider the point in which the train and validation start to have significantly different error rate (overfitting) as your objective function for choosing children. It may be a good idea to use a very small subset of your data (10-20%) until you choose a final model to reach a conclusion faster. Also, use a single seed for your network initialization to properly compare the results.

10-50 generations should yield great results for a decent sized network.

Another very interesting way is Bayesian optimization which is also an extremely effective black-box optimization method for a relatively small number of parameters. https://arxiv.org/pdf/1206.2944.pdf — Dan Erez, Dec 15 '17 at 15:03

score 6 · Answer 11 · edited May 28 '18 at 06:38

6

Number of Hidden Layers and what they can achieve:

0 - Only capable of representing linear separable functions or decisions.

1 - Can approximate any function that contains a continuous mapping from one finite space to another.

2 - Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.

More than 2 - Additional layers can learn complex representations (sort of automatic feature engineering) for layer layers.

edited May 28 '18 at 06:38

mkt

11,770
9
51
125

answered May 28 '18 at 04:02

sapy

177
2
3

15

Source(s) please. – *Reviewer* – Jim May 28 '18 at 08:05
I would like that too. I think the >2 category referes to deep neural networks (or deep learning). – partizanos Dec 20 '20 at 04:01
I found this here: https://www.heatonresearch.com/2017/06/01/hidden-layers.html Though it's not sourced well either. – Elenchus Oct 02 '21 at 23:35

How to choose the number of hidden layers and nodes in a feedforward neural network?

11 Answers11

Linked

Related