Is there a fundamental difference between artificial neural networks and "other" supervised machine learning models

Question

I would like to link three of these resources, present my understanding of what I read, and pose the question if my understanding is approximately correct

Is Machine Learning glorified curve fitting

Difference between supervised machine learning and design of experiments?

How useful is Turing completeness? are neural nets turing complete?

Reading the linked resources and assuming that the answers and comments are correct, I gather that (I quote myself, including bad spelling: this post is heavely edited, the initial content is in the history)

From a certain, very abstract viewpoint, a supervised machine learning problem is an optimization problem. You try to maximize some "measure of good-ness" or minize some measure of "bad-ness". The way you formulate the problem defines, in a way, the "machine learning model". There might be different ways how you solve the optimization problem ("train the model").

is not wrong. [A]

Question 1: is [A] correct?

If [A] is correct, than in that sense, f.ex. a CNN (or ANNs in generall) is not different from a SVM or a Random Forest. [B]

Question 2: Is [B] a reasonalbe conclusion?

However, there exists proof that

1) FFN are universal approximators

2) RNNs are Turing complete

I've read (unfortunately, I can't find the resource anymore) that it is possible to approximate Random Forest (and I would assume SVMs, ...) with certain kind of ANNs, but not the other way round.

So in that sense the concept of ANN is more generall than SVMs or Random Forests (I'm struggling with the language a lot here...) [C]

Question 3: Is [C] correct?

There is no difference, they all boil down to mathematical formulas, which are more or less similar. — user2974951, Aug 29 '19 at 07:24
Yes there are some fundamental differences between those techniques, but it is unclear what difference you are asking about. — lcrmorin, Aug 29 '19 at 08:22
I added a small wall of text to the question. I realize that maybe my question is not very "good", but I thought I just ask... — omoithesane, Aug 29 '19 at 09:03
It seems your are asking multiple questions about multiple models. Each of them will have a very different answer for each model... Plus it seems you don't have all the basic concepts of ML. I would strongly suggest you to start with an introductory book (An introduction to statistical learning for exemple - I don't know if it still freely availabe). A less introductory book would be the Elements of statistical learning (available here : https://web.stanford.edu/~hastie/Papers/ESLII.pdf). It includes (p351) a benchmark of common models (Neural Nets, SVM, Trees, MARS, k-NN) on relevant metrics. — lcrmorin, Aug 29 '19 at 10:04
Adding to the comment of @were_cat : Introduction to statistical learning is availabe here: http://faculty.marshall.usc.edu/gareth-james/ISL/ — Björn, Aug 29 '19 at 10:22
It's unclear what kind of information you are seeking or what you would like to know about neural networks. The duplicate appears to be a good match to understanding what characterizes a neural network and makes ANNs distinct from other machine learning models. If you different question, I suggest that your refine your post to be more specific and focused on a concrete topic; otherwise, I think this will remain too broad to be answerable in this format. More information on how to use this website to ask a good question can be found in the [help]. — Sycorax, Aug 29 '19 at 11:55
Thank you all for your input. I will read through the dublicate and maybe come back to this question. — omoithesane, Aug 29 '19 at 12:10
Anyone who is interested in previous drafts of your question can read through the edit history, so I've deleted the "original post" content. — Sycorax, Aug 29 '19 at 13:42
There is an equivalent to universal approximation theorem for polynomial regression, the stone Weierstrass theorem .... And you can implement polynomial regression with svms — seanv507, Aug 29 '19 at 13:54

Sycorax · Answer 1 · 2019-08-29T14:17:54.427

Question 1

This is good summary of how iterative gradient updates work, but this definition leaves out some classes of models, so I'm hesitant to agree that it is comprehensive.

For example, $k$-NN classifier does not minimize a loss. It doesn't even "learn" anything: at the time that you'd like to classify some new observation, it does a nearest neighbors search. In contrast to logistic regression or neural networks, there's no equation to evaluate. In contrast to a decision tree, there's no tree. All it $k$-NN does is measure distances. Each time you have a query, it starts measuring distances anew. In this sense, a $k$-NN doesn't generalize anything about the training set to some other abstraction (a tree, a formula). Additionally, there's no training procedure, so nothing about the $k$-NN classifier is iteratively updated.

In another example, random forest proceeds by always making greedy splits. While it's true that each of these splits is an optimization (maximize information gain), the total loss of the model is never evaluated during training, nor are previous trees updated. So in that sense, it's not correcting itself, even if some of its trees usually make wrong predictions. The whole point is that, on average, the ensemble will do well.

Question 2

I don't agree with your definition. Additionally, I think that we can make some meaningful distinctions between the iterative gradient updates of SVMs and neural networks. SVMs were expressly designed to be strongly convex optimization problems, while neural networks are non-convex optimization problems. So although both of them are using iterative gradient updates to improve model fitness, we can have much more confidence that a trained SVM is optimal (at least wrt training data), because a strongly convex problem has a unique minimum.

Question 3

UAT is an interesting theorem, but in applied settings, practitioners tend to get hung up on the word "universal" and neglect the hypotheses which are necessary for UAT to apply. As a practical matter, one might want to ask

Does the specific problem that I'm working on satisfy the requirements of UAT? Is the function continuous? Do we only care about a compact subset of reals?
Is it possible to find the ideal network weights in a reasonable amount of time?
Do we have enough data to learn the ideal network (sufficient number of neurons)?
UAT assumes we can tolerate $\epsilon$ amount of error. Is it possible to train the model to achieve sufficiently small error given the fixed amount of data that we have?
Stated another way, does our model over- or under-fit the data?

In applied settings, similar sentiments can be applied to Turing completeness of a particular network.

In a setting where you do not know the underlying function you wish to approximate, it will be challenging to answer these questions.

Neural networks are very general and flexible, but that generality and flexibility comes with costs. Tuning and training a neural network is a very expensive proposition, and it may take a lot of work to find a network which can out-perform a simpler model. In most tabular problems (problems where the data is a single matrix where rows are observations and columns are feature values), a random forest is an extremely powerful "default" model which will be hard to beat without investing lots of R&D time.

Moreover, there are lots of ways to approximate smooth functions on a compact interval which are much simpler than a multi-layer neural network. For example, spline regression is a very flexible tool that is not much more complicated than a linear regression. The Stone–Weierstrass theorem provides a similar result to the UAT in the case of polynomial regression (thanks, @Sean507).

These observations are not intended as criticisms of theorists who demonstrate that some models have these sophisticated properties, but if one's interest is primarily in applications and not theory, then a fixation on generality or provability can be a distraction from more pressing concerns, such as deadlines and customer satisfaction.

Thank for your answer! I literally slapped my head when you mentioned k-NNs :) — omoithesane, Sep 02 '19 at 17:44

score 0 · Accepted Answer · answered Aug 29 '19 at 20:29

The answer to all of your questions is no.

Question 1

Supervised learning is often formulated as an optimization problem, but this is not always the case. In addition to the counterexamples Sycorax mentioned, there's an entire world of Bayesian models where learning doesn't necessarily involve optimization. Here, the goal is to estimate a probability distribution over parameters or function outputs, and this involves integrating over all possibilities rather than optimizing.

Question 2

One can define equivalence classes of methods any way they like. But, I'd argue that the set of all methods that optimize an objective function is not a very useful categorization, and neglects many important differences. Among them...

Neural nets are simply a class of functions. They can be used for supervised learning, but also for other purposes. In contrast, SVMs and random forests define both a class of functions (i.e. hypothesis space) and a learning algorithm (which is a procedure that maps datasets to functions in the hypothesis space). The SVM and random forest learning algorithms are specifically formulated in a supervised learning context. One can define variants of these methods that work in other contexts, but can't strictly call these variants 'SVMs' or 'random forests'. From this perspective, neural nets are not in the same category as SVMs and random forests.

In the context of supervised learning, neural nets, random forests, and SVMs have been widely observed to perform differently on different datasets. This is a consequence of different inductive biases. However, inductive bias also depends on many specific choices for each method (e.g. data preprocessing, network architecture, learning algorithm, kernel choice, hyperparameter optimization procedure, etc.).

Additionally, there are important differences in the practical implementation of these methods (e.g. regarding learning/optimization and computational requirements).

Question 3

It's true that feedforward neural nets are universal function approximators (see here, but be aware that this has a very specific meaning, and there are caveats). It's also true that recurrent neural nets are Turing complete (see here). First, note that these results apply to general classes of neural nets; a given instantiation or particular subclass may not have these properties. Second, these results don't say anything about learning from data. And, third, they're of limited utility in an applied setting, as Sycorax mentioned.

Connection between neural nets and random forests

Various papers have explored the connections between neural nets and random forests or decision trees. For example, see Welbl (2014). Casting Random Forests as Artificial Neural Networks (and Profiting from It). Given a trained random forest, one can explicitly construct a neural net that implements the same function. I don't know whether or not there are papers exploring the reverse. But, one could make the following trivial argument: The class of decision trees of unbounded depth can represent any function by effectively acting as an infinite lookup table. The same property extends to random forests, since they're composed of decision trees. So, given any neural net, there exists a random forest that implements the same function. Of course, this isn't particularly interesting from a practical perspective.

Connection between neural nets and SVMs

There's a well known equivalence between kernel machines and feedforward neural nets with a single (nonlinear) hidden layer and linear output. The universal approximation theorem applies to this class of networks. So, assuming we can use any kernel function, and restricting ourselves to valid conditions for the UAT, a kernel machine can approximate any function a deeper network can. But, note that this doesn't imply that a deep net and kernel machine would produce the same output when trained on a finite dataset.

Thank you very much for your answer! Helped me a lot! – omoithesane Sep 02 '19 at 17:43 — omoithesane, Sep 02 '19 at 17:43

Is there a fundamental difference between artificial neural networks and "other" supervised machine learning models

2 Answers2

Question 1

Question 2

Question 3

Question 1

Question 2

Question 3