12

It's quite intuitive that most neural network topologies/architectures are not identifiable. But what are some well-known results in the field? Are there simple conditions which allow/prevent identifiability? For example,

  • all networks with nonlinear activation functions and more than one hidden layer are not identifiable
  • all networks with more than two hidden units are not identifiable

Or things like these. NOTE:I'm not saying that these conditions prevent identifiability (though they seem pretty good candidates to me). They are just examples of what I mean with "simple conditions".

If it helps to narrow down the question, feel free to consider only feed-forward and recurrent architectures. If this is still not enough, I'd be satisfied with an answer which cover at least one architecture among MLP, CNN and RNN. I had a quick look around on the Web but it looks like the only discussion I could find was on Reddit. Come on, people, we can do better than Reddit ;-)

DeltaIV
  • 15,894
  • 4
  • 62
  • 104
  • 1
    what's the purpose of this academic exercise? – Aksakal Nov 30 '17 at 19:33
  • 1
    Can I please ask, what have you considered/examined from the existing literature? This seems like a very niche question; the very few relevant references that I have seen associated within the system identification literature rather than standard ML (eg. [1](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.4648), [2](https://doi.org/10.1017/S0266466603195059), [3](http://ieeexplore.ieee.org/abstract/document/4017892)). Can you please define your question a bit more in the context of ML? Identifiability is mostly a Control Systems aspect; are you "just" referring to 1-1 relation? – usεr11852 Nov 30 '17 at 19:53
  • I think you should be able to easily prove these results using the implicit function theorem. – Alex R. Nov 30 '17 at 20:33
  • 1
    @Aksakal what's the purpose of [computing the probability that the urn is empty at noon, after infinite steps in which 10 balls are added and one removed?](https://stats.stackexchange.com/questions/315502/at-each-step-of-a-limiting-infinite-process-put-10-balls-in-an-urn-and-remove-o). No one, but yet the question was fun. Not all questions need to have practical relevance, in order to be worth answering. Or you could say that the lack of identifiability prevents you from doing precise inference on the NN weights, but that would be a false justification because almost no one is interested... – DeltaIV Nov 30 '17 at 22:25
  • 1
    ...weights inference. Nearly everyone is interested in predictive accuracy. – DeltaIV Nov 30 '17 at 22:25
  • @DeltaIV think of the applications of ML such as self-driving cars. Does it matter that a NN model is identifiable if the car gets where it need to get as often as humans do? So, in this setup it's not just predictive accuracy that matters, because the ultimate goal of machine learning is not about predictions. Can I predict what is the traffic on I-66? No idea. I know that I always get to work on time. So my brain's traffic model may not be identifyable but it accomplishes goals. – Aksakal Nov 30 '17 at 22:31
  • @usεr11852 1-1 relation. I use the standard definition of identifiable statistical model: given a statistical model $S$ parametrized by a parameter vector $\boldsymbol{\theta}$, and indicating with $P_{\boldsymbol{\theta}}$ the joint probability distribution between inputs and outputs for a fixed $\boldsymbol{\theta}$, we say that $S$ is *identifiable* if $P_{\boldsymbol{\theta}_1}=P_{\boldsymbol{\theta}_2}\Rightarrow \boldsymbol{\theta}_1=\boldsymbol{\theta}_2$. – DeltaIV Nov 30 '17 at 22:42
  • 3
    @DeltaIV, it's a valid question for CV. The problem's that nobody cares to think about this stuff, I'm afraid. Everyone's busy building models and making money, when the models stop working that's when unemployed AI thinkers will ponder the identifyability – Aksakal Nov 30 '17 at 22:45
  • @AlexR. you're probably right, however some cases I could think of seem to be just a problem of reparametrization. Suppose for example 1 hidden layer with just one linear neuron $Z$ and 1 output neuron $Y$, also linear. You have $Z=b_1+w_1X$ and $Y=b_2+w_2Z \Rightarrow Y=b_2+w_2b_1+w_2w_1X$. Obviously $w_2$ and $w_1$ are not separately identifiable, but you can just reparametrize the model and have an identifiable model in less parameters. I was looking for some "intrinsic" non-identifiability, which couldn't be resolved by simple reparametrizations. Anyway I encourage you to write an answer! – DeltaIV Nov 30 '17 at 22:53
  • @AlexR. ps of course also $b_2$ and $w_2b_1$ are not separately identifiable. I was referring to $w_2$ and $w_1$ only, for the sake of brevity. – DeltaIV Nov 30 '17 at 22:55
  • @Aksakal I do agree with you on that. It's definitely a problem no one seems to care about. Maybe I'm more curious (or more nitpicky :) than the average Deep Learning practitioner. – DeltaIV Nov 30 '17 at 23:05
  • @DeltaIV: Thank you for the clarification. Yeap, I agree with Aksakal on this, this is a pretty barren question at the moment compared with other NN-related concepts. I thought you were based on the concept of [System Identification](https://en.wikipedia.org/wiki/System_identification) which is much more active (that's why I asked about what reference you had seen so far) especially because of your reference to RNNs; maybe you want to check the concept of [Nonlinear system identification](https://en.wikipedia.org/wiki/Nonlinear_system_identification#Neural_networks) and the connections to NNs. – usεr11852 Dec 01 '17 at 00:14

2 Answers2

5

There at at least $n!$ global optima when fitting a 1-layer neural network, constituted of $n$ neurons. This comes from the fact that, if you exchange two neurons on a specific level, and then you exchange the weights attributed to these neurons on the next level, you will obtain exactly the same fit.

RUser4512
  • 9,226
  • 5
  • 29
  • 59
3

Linear, single-layer FFNs are non-identified

The question as since been edited to exclude this case; I retain it here because understanding the linear case is a simple example of the phenomenon of interest.

Consider a feedforward neural network with 1 hidden layer and all linear activations. The task is a simple OLS regression task.

So we have the model $\hat{y}=X A B$ and the objective is $$ \min_{A,B} \frac{1}{2}|| y - X A B ||_2^2 $$

for some choice of $A, B$ of appropriate shape. $A$ is the input-to-hidden weights, and $B$ is the hidden-to-output weights.

Clearly the elements of the weight matrices are not identifiable in general, since there are any number of possible configurations for which two pairs of matrices $A,B$ have the same product.

Nonlinear, single-layer FFNs are still non-identified

Building up from the linear, single-layer FFN, we can also observe non-identifiability in the nonlinear, single-layer FFN.

As an example, adding a $\tanh$ nonlinearity to any of the linear activations creates a nonlinear network. This network is still non-identified, because for any loss value, a permutation of the weights of two (or more) neurons at one layer, and their corresponding neurons at the next layer, will likewise result in the same loss value.

In general, neural networks are non-identified

We can use the same reasoning to show that neural networks are non-identified in all but very particular parameterizations.

For example, there is no particular reason that convolutional filters must occur in any particular order. Nor is it required that convolutional filters have any particular sign, since subsequent weights could have the opposite sign to "reverse" that choice.

Likewise, the units in an RNN can be permuted to obtain the same loss.

See also: Can we use MLE to estimate Neural Network weights?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • I was specifically excluding this case (linear activation functions) in the comments to my question, because it's trivial to obtain a identifiable model, starting from this one, which gives _exactly the same predictions_, with a simple reparametrization. It's not "intrinsically non-identifiable", so to speak. So I was specifically referring to nonlinear activation functions. But I reckon that I should include that in my question, not just leave it into comments. In a few hours I will modify my question accordingly. – DeltaIV Jul 04 '18 at 15:46
  • It’s best practice to edit your question to clarify what you’re interested in knowing about. – Sycorax Jul 04 '18 at 16:15
  • you're right, I usually do, but this time I forgot. My bad. – DeltaIV Jul 05 '18 at 08:10