Why ML involves trial and error and no overarching rules?

Question

I am a beginner to ML and working on some traditional classification problems as of now.

I see that some of the results/decisions made by ML are a result of trial and error by the data scientists. Meaning, there is no proper logic or theory/guideline to the way how an ML works. So, the behavior of model (if not most) at least sometimes it is purely driven trial and error approach. So, we don't know why it does what it does. we keep tweaking until we get a result that works well for our data.

For example, from this post here, we can see that there is no proper guideline/theory as to which is the right approach to do feature selection.

Similarly, in this post here, we can see that some LIME explanations are random (and doesn't make sense). Again, it could be drawback of the LIME approach. But each approach has its pros and cons. Even if I choose another explaimable AI appraoch like Shap etc, it will also have some issues amd may not be consistent for all rows. I also understand some issues could be due to data (but for the sake of this discussion let's restrict only to the behavior of ML models)

Is it because ML tries to find an estimate/approaximation of real behavior, we have to settle with things (that cannot always be drilled dowm to specific reason but rather accept it because thats what the model outputs)? How do we convince business users in that case?

My question is mainly on why ML behavior is random (that its working/results/decisions etc cannot always 100% proven by evidence/theory etc).

Is there any simple example that you can give to explain this?

Hope my question is clear. I sincerely hope you get the idea of my question. Sorry, for my poor language if unable to communicate clearly

I’m not so convinced that this is correct, even if it’s a common way to approach machine learning. — Dave, Feb 23 '22 at 02:06
Would it be possible for you to explain your thought process as an answer in detail (if you have time), so I can really understand your view point (which can help me understand the situation better)? — The Great, Feb 23 '22 at 02:08
This isn't really a good use of the word "random"; perhaps "able to construct complicated nonlinear models out of large feature sets" would be better, because the heart of the problem is that humans have a hard time intuitively understanding complicated nonlinear models built on large feature sets, and it's hard to simplify them enough for humans to understand them without losing so much of the complexity that our understanding of what's left is too incomplete to be more than marginally useful. — jbowman, Feb 23 '22 at 03:49
@jbowman - have you encountered any inconsistency with the way ML model operates? For instance, if you refer my posts linked in OP, you may get an idea — The Great, Feb 23 '22 at 04:03
I'm not sure what the feature selection issue is about; @gunes answer seems complete and accurate to me. There is no "right" approach that is feasible with any but small feature sets; the "right" approach is not feasible computationally. So your premise is incorrect; there is indeed a proper theory etc., we just can't use it on large problems because of computational complexity. And as for LIME etc., I think my comment above explains why you won't get a 100% accurate explainer any time soon. — jbowman, Feb 23 '22 at 04:21
The essential characteristic of a statistical or machine learning model is that its behavior changes with change in data. This is by design, since we want to learn something new from the data. But the cost of that flexibility is that the variation may be too extreme for our purposes; the model might be too sensitive. In this sense, the models also pick up signal randomness. — Sycorax, Feb 23 '22 at 04:27
"ML behavior is random (that its working/results/decisions etc cannot always 100% proven by evidence/theory)" 1. This is true of any regression model, including linear regression. There are some problems where a physical model is available (fluid mechanics, heat transfer), but because this a statistics forum, I assume we're talking about models of data. 2. I don't think this is a good use of "random." Fitting an ML model with the same parameters and same data should give the same result every time. And making predictions with the fit model should give the same predictions. No randomness. — Arthur, Feb 24 '22 at 20:19
@Arthur - replaced the word `random` in question. Thanks @vladimir for helping me with his answer (which allowed me to express it correctly) — The Great, Feb 25 '22 at 00:48

Vladimir Belik · Accepted Answer · 2022-02-24T19:57:15.753

If I understand correctly, I think you are noticing a general tendency in machine learning - that everything is guided by "trial and error", by "experimentation" rather than some overarching rules.

In many cases, when asked "Why did you do X or Y", you might not be able to give an answer any better than "well, that just works best on the data".

If I understand your question correctly, I think it's super interesting, philosophical and probably doesn't have a single, straightforward reason. I will give my thoughts on the subject, though.

I think there are a huge number of possible perspectives, but here is one. Our world is an insanely complex interaction of an innumerable quantity of various physical laws. The results of these interactions often have noticeable distributions, correlations etc.

Machine learning is not specific to any field - it is simply a "universal"/general set of methods which attempt to extract patterns from data (of any kind). Machine learning isn't built to incorporate laws of physics, or to take into account the physical/chemical/psychological/geological reason for observations. It is simply built for "put in distributions of data" and get predictions/structure as an output.

There are any number of possible variable interactions, correlations and otherwise "informational structure" in the world, and so machine learning attempts to capture this structure - but it is blind to the "reason" behind it. The job of RandomForest isn't to discover why people are clicking on Facebook ads. It's just to optimize a function which represents the predictive accuracy of this algorithm on whether a person will click or not, for example.

As a result, it is (in many ways) an "art" - because as long as you're not violating a very small number of core assumptions (data leakage etc.), it simply doesn't matter what you do with the data as long as it delivers a good result. Want to do feature selection first, then hyperparam tuning? Try it. Want to do it the other way around? Try it! There are no laws to break here - there is only good and bad predictive results. You get bad results when the process doesn't allow the ML algorithm to capture useful patterns in the data (if there are any). Good results happen when (for any number of reasons), the algorithm was given information in a way that allowed it to capture enduring patterns/dependencies.

"Why is the algorithm doing this?" . On a most base level, the answer will always just be "because of how the algorithm is designed to work + the data it was given". Any interpretation beyond that will involve subject matter expertise and careful and deep dives into the data. The algorithm doesn't care about that interpretation - it's just optimizing a function for the data that you give it. Any interpretation on top of that is your attempt at understanding the process that generated the data, and how that could have led to the algorithm noticing X or Y.

I know this was a big vague and hand-wavy, but I hope it gives some space to think about ML broadly and why it is so experimentally focused.

At the moment, a huge amount of ML is focused on prediction - so as long as the model gives good predictions, anything else is irrelevant, in a way. You just know "The particular way that I'm digesting the data and feeding it to the algorithm results in real world information being combined in such a way that there are noticeable informational/mathematical dependencies that are being captured by this sequence of optimization steps I call 'my ML model' ".

Important Addition: I think feature engineering (for predictive purposes) is a great illustration of the point. When engineering features, you don't necessarily need to be guided by any principles. Take the log() of the feature, normalize it, take a moving average, take a Z score. You can do all these things, and when you do (and if it works) you are left asking the same question - but why did taking a Z score help? You might find an interpretable answer. But part of the answer will always be something like "Because for whatever reason, because of whatever physical laws are governing this complex interaction you're observing, digesting the information in that particular way makes the pattern in the data more clear and so your model can pick it up easier". I think that kind of thinking can be applied generally to this whole question.

wow. just wow. You managed to answer exactly what I was looking for. Thanks for your help. upvoted and accepted. I always find your answers to be detailed and easy to understand (help answers the question exactly). btw, do you think what would be the right word to replace 'random'? So, people don't get confused — The Great, Feb 25 '22 at 00:33
your example on feature engineering was spot on. we do all the transformations like log, z-score transformations etc but we don't why does it give better results (when compared to original form). — The Great, Feb 25 '22 at 00:36
may I seek your help with this related feature engineering question if you have time - https://stats.stackexchange.com/questions/565618/how-to-interpret-aggregate-primitive-features-such-as-max-min-std-dev-during-e — The Great, Feb 25 '22 at 00:49
@TheGreat You're very welcome! and of course, I'll take a look. I'm glad you found the feature engineering analogy helpful, I liked it too :P Unfortunately, I'm not sure I can think of a word to replace "random". You'd probably just have to explain it, like "Why is ML governed by almost pure experimentation, rather than being theory/rule/law driven". — Vladimir Belik, Feb 25 '22 at 06:20
Am not sure whether the above post is open/closed. So, I have created a similar post here. If you can help with this, would really be helpful https://datascience.stackexchange.com/questions/108496/interpretation-of-statistical-features-in-ml-model — The Great, Feb 25 '22 at 15:07

Why ML involves trial and error and no overarching rules?

1 Answers1

Linked