Datasets that "simple" models fail but more complex models work

Question

Notes:

The term "simple" in this context is defined at the end of the question;
I believe this question is somehow not about dataset request even if its title is named this way...

Main body:

Datasets I have seen usually fall within one of the following three categories:

"simple" models work (i.e., prediction consistently better than random guess, even by a tiny margin such as 1%), machine learning/deep learning models work better
"simple" models work, machine learning/deep learning models are more or less the same
"simple" models don't work (i.e., prediction not better than random guess), machine learning/deep learning models don't work either.

This rule does not only apply to tabular data but also apply to some simple image classification tasks as well (but yes, you may argue that underlyingly images are tabular data as well since we flatten the high-dimensional data to 1-D array anyway). I tried cats and dogs classification--logistic regression can do better than random guess although it is much worse than CNN (so it falls to category I). Since I work mostly with tabular data, I did not try some even more complicated image classification tasks.

Let's say we discuss only supervised classification using tabular datasets, that is, the datasets with a few (or a LOT if you wish...) columns as X and one categorical column as y. Under this constraint, do you know any publicly available or synthetically generated datasets with which a "simple" model does not work at all but a more complex model can do better (even just slightly) than random guess?

In the context of this question:

"simple" means linear models (e.g., OLS, Ridge, Lasso, Logistic, etc) plus decision tree and k-NN. These models are considered "simple" mostly because they are computationally inexpensive;
"linear" means linear with respect to X and no feature engineering, such as $x_2 = x_1^2$, is allowed. So that one can not add non-linearity to the model by manually creating new features from old olds--I believe this makes my question easier; otherwise feature engineering allows OLS to go much further;
I confine the question to classification tasks only because the meaning of "random guess" is not very clear for a regression problem (we guess by mean? median? mode? frequency? how about continuity and differentiability?) Please let me know if I am wrong;
Overfitting control techniques such as cross validation are allowed just in case the synthetic dataset tries to trick a simple model into severe overfitting, which is not the intent of this question;
SVM, Gradient Boosting, etc are not considered "simple", but suppose you have a dataset or you can generate a synthetic dataset which fails these models but cannot fail a neural network, I will be even more interested to know what such dataset looks like.

Datasets tried:

vanilla hand-written digit recognition: Category I
obfuscated hand-written digit recognition: Category I
obfuscated hand-written digit recognition with binary label: Category I

EDIT 1: The original term "linear" is changed to "simple", which reflects what I want to ask more accurately;

EDIT 2: k-NN is added to the list of "simple" models.

Unfortunately dataset requests are off topic on this site. As a side note, what you're looking for may or may not exist, depending on what you mean by "linear model". If you mean linearity w.r.t. the inputs, then it would be simple to generate a synthetic dataset where only 'nonlinear' models can succeed. But, things aren't so straightforward if you mean linearity w.r.t. the parameters (which is the typical meaning, where things like polynomial regression and kernel methods count as linear). — user20160, Nov 28 '21 at 14:50
hey @user20160. I just checked the page `What types of questions should I avoid asking?` (https://stats.stackexchange.com/help/dont-ask) and it appears to me it does not mention that my question should be avoided. But anyway, let's say we use the first definition (i.e., linearity w.r.t. the inputs), have you actually seen something like this? (Even if it is specifically generated to fail linear models) I think most datasets that fit your 1st definition fall within my category I--linear models usually can do a bit better than random guess, but non-linear models can fit really well. — Alex Kong, Nov 28 '21 at 15:10
Thanks for checking the help center. The rule against dataset requests actually seems to be on a different page in the help center; the page you mentioned should probably be updated. Your question has an interesting underlying statistical issue (i.e. what would the properties of such a dataset be), so it definitely belongs here if that's your primary interest! — user20160, Nov 28 '21 at 17:59
hi @user20160, also I greatly revised my question which reflects my intention much better now. — Alex Kong, Nov 29 '21 at 02:07

Dave · Answer 1 · 2021-11-28T23:15:11.970

Dataset requests are off-topic, but there is an interesting statistical take. I will simulate some data in R.

set.seed(2021)
N <- 10000
x <- runif(N, -1, 1) 
y <- 2 + x^2 + rnorm(N, 0, 0.1) 
L <- lm(y ~ x)
L2 <- lm(y ~ x^2)

If you just run linear regression on “x”, your performance will be awful. However, if you run a complex algorithm like a neural network, perhaps with more data, you can get a decent fit to the parabolic shape. If, however, you know to expect the parabolic relationship, the regression on “x^2” has excellent performance.

Some of the appeal of machine learning is that we can delegate this “figure out the features” to algorithms, rather than doing it by hand.

The universal approximation theorem in neural networks says that, for “decent” functions and with some technical considerations, single-layer neural networks can approximate anything. If you visualize the architecture of such a neural network, you will see that the final output is a linear combination of the hidden layer (plus an activation function). That is, a neural network does feature extraction and then applies a linear regression (or generalized linear model).

If you’re able to figure out the features in that hidden layer the way that you could have figured out that my simulation included a quadratic term, your linear regression (or other GLM, such as logistic) will do just as well as the neural network.

Some of the trouble with finding a dataset where a (generalized) linear model does not work at all is that there is likely to be some linear component.

Hi @Dave, guess my original question does not quite reflect what I want to ask and I change it a lot. But yes I totally agree with what you said. — Alex Kong, Nov 28 '21 at 16:16

Tim · Answer 2 · 2021-11-30T17:24:21.667

0

You are making so many ad hoc assumptions and it is hard to meet them all.

You want to focus only on "supervised classification using tabular datasets", that's the kind of datasets where many "simple" algorithms work very well. This kind of problems commonly are solved with algorithms like logistic regression or decision trees. If you look at the 2021 Kaggle Survey, they are the most popular ones.
If you included natural language data or image classification, in many cases deep learning algorithms would outperform other algorithms. Deep learning got popular because it proven to work good for such examples.
You could design datasets that would guarantee a particular algorithm to fail. For example, linear regression assumes linearity in parameters, so you need to break this assumption (or data with interaction term needs an algorithm that can fir such patterns). But this would be hard when you pick an arbitrary list of algorithms.
Your distinction between "simple" and "complicated" algorithms is very arbitrary. Linear regression is a parametric model, that makes very strong assumptions about the data. $k$-NN is a non-parametric model that makes almost no assumptions about the data. $k$-NN can be able to fit arbitrarily complicated patterns given that you have enough data and a computer with enough memory for it to work. If you consider decision tree a “simple” algorithm, why not include random forest in this set? If you consider logistic regression, why not neural network as logistic regression is the simplest possible neural network?
Polynomial regression is a universal function approximator same as decision trees. Those algorithms can solve very complicated data problems.

So basically, you are trying to cherry-pick a dataset under a rather arbitrary set of constraints. The constraints are not clear-cut and well defined, so there is no obvious answer. You can always go through Kaggle competitions to find such dataset that meets the constraints.

edited Nov 30 '21 at 17:24

answered Nov 28 '21 at 17:03

Tim

108,699
20
212
390

I discovered this question from your "top posts" section: https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless . Believe it or not, I think the underlying question I am asking is somehow remotely similar to your question. At the end of my question, I have a section called **Datasets tried**--they are subdirs of a repo, the README.md file of that repo basically describes what you asked in your forecastability question. – Alex Kong Nov 30 '21 at 16:02
I guess you missed my point. This question is actually about "causality" (as I call it in my repo https://github.com/alex-lt-kong/detecting-causality-with-simple-models) or "forecastability" (as you name it in your question: https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless). On its face it is about finding a dataset, but it is not--it is actually about "how can we decide if causality exists" in my repo or "when should we stop if nothing works" in your question. – Alex Kong Nov 30 '21 at 16:15
@AlexKong your question has nothing to do with causality. Machine learning models only detect correlation, not causality. You can have causal relationships hard to detect with ML and non-causal relationships detected by ML. The conclusions you are trying to draw from this exercise are incorrect. – Tim Nov 30 '21 at 16:26
yes I mentioned the difference between causality and correlation in my repo's README, in case you dont like the word "causality", you can use correlation instead. I also revied my README so that future readers dont have to be too careful to notice in my repo these two terms are interchangeable. – Alex Kong Nov 30 '21 at 16:35
@AlexKong you seem to be trying to prove that if for dataset X a set of algorithms A (“simple”) can find the relationship but set of algorithms B doesn’t then the relationship is significant. It only proves that the relationship can be found by algorithms A not B. Same as nothing about those algorithms is “simple”, same nothing about such relationship is “causal”. This would just cherry-pick adversarial datasets that trick particular algorithms. – Tim Nov 30 '21 at 16:47
No...you totally miss the point...It is NOT about "proof", "outperform", "winning", "losing", "comparing algorithms A and B" at all...you used so many words to argue a LOT of points which I agree. I basically agree with all you said just they are not the point of this question. – Alex Kong Nov 30 '21 at 16:50

score 0 · Answer 3 · answered Nov 29 '21 at 06:47

Let's say we discuss only supervised classification using tabular datasets, that is, the datasets with a few columns as X and one categorical column as y. Under this constraint, do you know any publicly available or synthetically generated datasets with which a "simple" model does not work at all but a more complex model can do better (even just slightly) than random guess?

I believe that any such example would have to be so contrived that you are very unlikely to see it in real life.

If by "better" you mean in terms of classification error, you can actually get close to a proof of this fact by using the Cover-Hart inequality.

(The Cover-Hart inequality says that in a two-class classification problem, the error of a 1-nearest neighbour classifier is asymptotically not worse than twice the Bayes rate. Here the "Bayes rate" is the lowest possible error for any classifier, and "asymptotically" means as the number of rows of your data set gets larger and larger.)

I know that your list of simple classifiers doesn't include 1-nearest neighbour, but in practice it seems that the performance of a decision tree is likely to be quite close to 1-NN, because both algorithms partition the feature space into regions around known data points in which the predicted value for y is constant. (Certainly, I think quite some work would be required in order to get an example in which a decision tree doesn't work at all, but 1-NN has some predictive power.)

So, the only hope for finding a data set like this in the wild would be to have a situation in which you had a very large number of columns, which might be something like image classification, or even better maybe something like an algorithm which inputs a chess position and outputs the best move.

Actually, isn't this chess thing an example? It has about 64 features and you have to predict a categorical y. It's known that it can be done with a neural network, but certainly I don't think that any simple algorithm could work. Wouldn't other games like backgammon also provide examples? Now I have confused myself.

Hey @Flounderer, you discussed a lot of aspects of my question. Perhaps I need more than one comment to reply. The 1st thing I want to mention is that such as dataset does not have to be natural--if you can generate one it is also acceptable (well at least for the purpose for this question). 2nd: I think 1-NN should be considered simple as well, let me add it to my definition of "simple". — Alex Kong, Nov 29 '21 at 07:11
3rd: regarding the chess example, I think my argument against it is that this is not a "tabular" dataset. In a typical tabular dataset, the order of rows are totally irrelevant--but in the case of playing chess, how can you frame the problem into a tabular dataset classification task in which the order of rows are irrelevant? — Alex Kong, Nov 29 '21 at 07:16
If we find it hard to construct the chess problem into a tabular dataset classification task, no, it does NOT mean the chess example cannot be used to answer the underlying question I am asking, which is about simple models and sophisticated models. Just I think the question needs to be revised greatly to accommodate this example since the "simple" models listed in my original question are not meant to solve the chess-style question at all--if we need to use this example, we also need to expand the definition of "simple" models significantly, which is a bit beyond what I know. — Alex Kong, Nov 29 '21 at 07:21
4th, I am not sure if I get what you mean by this `If by "better" you mean in terms of classification error, you can actually get close to a proof of this fact by using the Cover-Hart inequality.`. But what is in my mind does not have to be a binary classification task. A task that has ten possible results, say, hand-written digit recognition, is also okay--but I tried, no need to have CNN, NN, or even decision tree, the naive linear regression can do much better than random guess. — Alex Kong, Nov 29 '21 at 07:24

Datasets that "simple" models fail but more complex models work

3 Answers3