Mathematical/Statistical Assumptions Underlying Machine and Deep Learning Methods

Question

I was recently reading a discussion amongst mathematicians/statisticians about machine and deep learning, and how they are applied by non-mathematicians/statisticians. The argument was that these methods are often applied incorrectly, since the people who are often using these methods do not have the appropriate mathematical/statistical background to understand them. For instance, some machine learning methods, and certainly deep learning methods, require large amounts of data to produce good results; however, people who don't understand these methods often apply them without adequate amounts of data. It was then mentioned that this ignorance sometimes works if you have large amounts of data, in the sense that having large amounts of data reduces the need for you to understand the assumptions of these methods and will yield good results regardless; however, it was then said that, if one wishes to use these methods in not-so-good conditions (say, in the absence of large amounts of data), then it is still possible to get good results, but the statistical assumptions of the methods then become important, since you don't have the large amounts of data to save/shield you.

As a novice, I want to research this further. What assumptions are being referred to here? In other words, what are these mathematical/statistical assumptions underlying these methods that one must understand in order to actually understand the methods and be able to apply them in not-so-good conditions? The first thing that came to my mind when I was reading this was the law of large numbers and the idea of the distribution of data approaching a normal distribution as the amount of data increases. Another, less concrete idea that came to mind was that there was probably some assumption here that is related to all of those inequalities that are taught in probability theory (bounding probabilities), such as Cauchy-Schwarz, Jensen, etc. But since I am a novice, this is all that I could come up with.

And please reference any research papers that discuss this! That would be much appreciated.

EDIT:

My understanding is that machine learning and deep learning are different (categories of) methods, so I've described them separately in case the underlying assumptions are different between them.

EDIT2:

If the assumptions are dependent on the specific method and too many to list, then are there any general assumptions across all methods (such as the law of large numbers and normality one I mentioned)? A sampling of a few important methods, their assumptions, and relevant research papers would be a fine answer. Deep Learning in particular would be an interesting one, since it's said to require so much data (what if I wanted to use Deep Learning with limited data? What assumptions would I need to be aware of?).

@Dave It was a Reddit thread in r/statistics that I had read a week ago, but I don't remember which one it was, and nor have I been able to find it again. — The Pointer, Sep 09 '20 at 18:24
This doesn't really seem like an answerable question. There are lots of different methods, with different assumptions among them. You might ask about what assumptions underlie a specific method, or what goes wrong if you violate an assumption of a certain method, but there's no such think as generic statistics/machine learning assumptions. Sometimes a method's assumptions are mutually exclusive of another's! The field encompasses a wide range of tools and methods, which might be appropriate in different cases. This is a feature, not a flaw, because we want to solve diverse problems. — Sycorax, Sep 09 '20 at 18:41
@Sycorax Can you reference some research papers that cover this? Say, for deep learning in particular? — The Pointer, Sep 09 '20 at 18:43
@Sycorax Also, are there any assumptions that are general and underlie all machine learning methods? For instance, the law of large numbers and normal distribution one that I mentioned? — The Pointer, Sep 09 '20 at 18:47
Here's a simple case. Naïve Bayes assumes that the effect of a feature on the outcome is independent of the values of the other features. But tree-based models (to pick just one example) explicitly try to model the outcome by sub-dividing the feature space into rectangles, and predicting a different outcome for each rectangle. Which one is correct? The model that reflects reality -- the naïve Bayes model does well when the independence assumption is valid, and does poorly when it isn't. — Sycorax, Sep 09 '20 at 18:55
The law of large numbers is a theorem, not an assumption, so I'm not sure why you mean. What you write about the normal distribution seems to be a reference to the central limit theorem, but it is not a proper statement of the theorem. — Sycorax, Sep 09 '20 at 18:56
@Sycorax Ahh, the central limit theorem – that's what I meant! — The Pointer, Sep 09 '20 at 18:57
@Sycorax That Naive Bayes example is exactly the type of thing I'm looking for! Really interesting! Do you happen to have research papers for this type of thing (as I mentioned, Deep Learning would be a nice one to read about, since it requires so much data)? — The Pointer, Sep 09 '20 at 19:01
A convolutional neural network assumes that nearby data (e.g. adjacent pixels) is important, while a fully-connected network does not. I don't know of a paper which is dedicated to this observation. — Sycorax, Sep 09 '20 at 19:22
@Sycorax You keep posting really good comments. The CNN one is highly relevant to my interests, so I'll try to find a research paper that discusses this. So if we have images where adjacent pixels are unimportant, and the important pixels are "diffuse" throughout the image, then CNN performance drops dramatically? Also, what is a "fully-connected network"? — The Pointer, Sep 09 '20 at 19:26
Some discussion of CNNs and images can be found in the ImageNet paper https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf A fully-connected layer has every neuron connected to every neuron in the next layer. — Sycorax, Sep 09 '20 at 19:27
@Sycorax The word "adjacent" is mentioned twice in that paper, and it doesn't explicitly mention the assumption you referred to. Perhaps it's an implicit assumption that you need to infer from having deep knowledge of CNNs. — The Pointer, Sep 09 '20 at 19:37
I'm not sure why you'd control-F for "adjacent" when you want to know about the assumptions of CNNs. The first two paragraphs outline how CNNs solve a key problem in computer vision. The key part seems to be "However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks (CNNs) constitute one such class of models" and then a long list of citations. — Sycorax, Sep 09 '20 at 19:45
@Sycorax But that begs the question of what is the prior knowledge that it is referencing to. — The Pointer, Sep 09 '20 at 19:46
That's what the citations are for. Also, subsequent sentences elaborate. "Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies)." — Sycorax, Sep 09 '20 at 19:49
@Sycorax Oh, ok. The citations were places in as "Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26].", so it seemed like they were just addressing the classes of models. Thanks for the clarificaiton — The Pointer, Sep 09 '20 at 19:50
From the R 'fortunes' package "To paraphrase provocatively, 'machine learning is statistics minus any checking of models and assumptions'. -- Brian D. Ripley (about the difference between machine learning and statistics) useR! 2004, Vienna (May 2004)" — user20637, Sep 09 '20 at 20:05
@user20637 Yes; rest assured, I have already read enough discussions of statisticians venting their frustrations about machine learning, so I am well aware of these problems. And this is exactly why I would like to learn about such assumptions! — The Pointer, Sep 09 '20 at 20:08
Of course people who fully understand the underlying mechanics are at an advantage and it can be dangerous to not understand them but things like neural networks did not arise from a purely statistical framework requiring hardcore assumptions to translate the formulas into something tractable like in linear regression. Of course it is still stats based and every inch of the theory is covered in stats/math but any failure of 'assumptions' reveal themselves in basic validation of our predictions. Trees are another model which come from a more abstract idea and give us incredible results. — Tylerr, Sep 09 '20 at 20:12
@Tylerr yes, I’m aware of this. However, my understanding is that a lot of work has been done on understanding the mathematical/statistical assumptions of these methods, which is why I’m asking for research papers. — The Pointer, Sep 09 '20 at 20:16

Sycorax · Accepted Answer · 2020-09-11T14:39:25.977

There's no such thing as universal statistical or machine learning assumptions. There are lots of different statistical/ML methods, with different assumptions among them. You might ask about what assumptions underlie a specific method, or what goes wrong if you violate an assumption of a certain method, but there's no such think as generic statistics/machine learning assumptions. Sometimes a method's assumptions are mutually exclusive of another's! The field encompasses a wide range of tools and methods, which might be appropriate in different cases. This is a feature, not a flaw, because we want to solve diverse problems.

Naïve Bayes assumes that the effect of a feature on the outcome is independent of the values of the other features. But tree-based models (to pick just one example) explicitly try to model the outcome by sub-dividing the feature space into rectangles, and predicting a different outcome for each rectangle. Which one is correct? The model that reflects reality -- the naïve Bayes model does well when the independence assumption is valid, and does poorly when it isn't.
Some data is non-independent, so using a model which supposes independence among each datum is inappropriate. The classic example of this is stock prices: an excellent predictor of an equity's price tomorrow is its price today, which means that a naïve model that just lags price by 24 hours will have small error, even though this model doesn't yield any information you didn't have already. It would be more appropriate to model stock prices using a times-series method.
A convolutional neural network assumes that nearby data (e.g. adjacent pixels) is important, while a fully-connected network does not. The sparse connections of a CNN, and the concept of a local filter applied to adjacent pixels turns out to be a good way to decide what an image contains.

Some of the things that you call "assumptions" (law of large numbers, central limit theorem, Jensen's inequality, Cauchy-Schwarz inequality) are theorems. Theorems are statements which apply a chain of reasoning from other true statements to show that a new statement is also true. Sometimes a theorem is not suitable for a certain situation; for example, the results of the CLT do not follow if the samples are drawn from a distribution with non-finite variance. It's difficult to understand what you mean about the applicability of something like CLT to deep learning, because the CLT is true in all settings where its hypotheses are satisfied. In other words, the CLT cares not whether you're using a neural network, it just cares about its hypotheses.

what if I wanted to use Deep Learning with limited data?

The main problem you'll face is pertain to model generalization: "How do I know that this model will perform well on out-of-sample data?" This is where regularization becomes important. We have a thread dedicated to this: What should I do when my neural network doesn't generalize well?

You've asked for papers about neural networks, so here's a good place to start. The AlexNet paper (Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks") used CNNs for the ImageNet task in 2012 and vastly out-performed their competitors. The authors' success in ImageNet basically kicked off the current frenzy of interest in using CNNs for image data. This paragraph from the AlexNet paper explains why CNNs are suitable for image data: the structure of the CNN encodes prior knowledge ("assumptions") about how images represent semantic data (i.e. objects). Specifically, CNNs assume stationarity of statistics and locality of pixel dependencies. They also suggest that CNNs will be easier to train than fully-connected networks because their of sparseness (fewer weights and biases to update).

To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.

The authors include citations to these papers. These papers develop why CNNs are effective at imaging tasks in more detail.

Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–97. IEEE, 2004.
K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In International Conference on Computer Vision, pages 2146–2153. IEEE, 2009.
A. Krizhevsky. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010
H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.
Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, et al. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, 1990.
N. Pinto, D. Doukhan, J.J. DiCarlo, and D.D. Cox. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS computational biology, 5(11):e1000579, 2009.
S.C. Turaga, J.F. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and H.S. Seung. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation, 22(2):511–538, 2010.

There are so many models in sklearn. Is there any single place where I can find the basic assumptions for each and every model? — ShyamSundar R, Mar 24 '21 at 21:48
Probably not. A good place to start would be to read a high-quality textbook such as Bishop's *Pattern Recognition and Machine Learning* or Hastie et al *Elements of Statistical Learning*. — Sycorax, Mar 24 '21 at 21:50
books are great. But not my interest as of now. I was basically looking for some easy way to evaluate whether my data is ready for a particular model or algorithm — ShyamSundar R, Mar 24 '21 at 21:53
In that case, you should refer to the first sentence of my previous comment. — Sycorax, Mar 24 '21 at 21:56

score 2 · Answer 2 · answered Sep 11 '20 at 16:03

I would disagree slightly with the opening statement of Sycorax's excellent and detailed answer "There's no such thing as universal statistical or machine learning assumptions" - in supervised machine learning, in general, it is assumed that your data is drawn IID from a probability distribution, and that any test/new data presented to the model after training will be sampled from the same distribution. This applies to the term "generalization" too - how well your model generalizes refers to how well it generalizes to new data sampled from the same underlying distribution as the training data.

The first issue here is that, when deployed in the "real world," new data is usually not generated from the same distribution as the original training and test data (not to mention not being sampled IID). So model performance naturally deteriorates.

Additionally, the higher-dimensional and more complex your data, the less likely it is you have a dataset that adequately represents the underlying distribution, partly because of the complexity of the distribution and partly because of sampling difficulties (have a look at the "tench" class in ImageNet to see pretty obvious example of severe sampling bias that will lead to poor performance as soon as you move outside the ImageNet validation set for images of real-life tenches...).

I assume that this might be what the conversations you're talking about refer to - does this make sense..?

TrynnaDoStat · Answer 3 · 2020-09-11T16:56:40.040

Assumptions essentially add information. This added information is more useful if you have less data. For example, contrast two OLS regression relationships

$Y \sim X + Z$
$Y \sim X + X^2 + X^3 + Z + Z^2 + Z^3 + X*Z + (X*Z)^2 + (X*Z)^3$

The first has more assumptions because it is a special case of the second. It's a special case because if the coefficients on all of the extra interaction and polynomial effects is zero, it simplifies to the first model. If you have "enough" data (enough depends on the situation) and the first relationship is the true data generating process, the second model will eventually figure out that the coefficients are zero and simplify to the first model. If you have enough data, you can fit a very general model that will eventually simplify to a simpler model.

However, if you do not have enough data things can go very wrong and you enter the world of over-fitting. With smaller data, it's more important to understand and make reasonable assumptions on your data. Simply fitting a very general model and having the model figure it out won't work.

Models like deep neural nets, tend to be very general models. With enough data, these models can simplify to simpler models if that's the true relationship.

"Assumptions essentially add information" I like this phrasing! +1 — Dave, Sep 11 '20 at 16:54

Mathematical/Statistical Assumptions Underlying Machine and Deep Learning Methods

EDIT:

EDIT2:

3 Answers3