Notes:
- The term "simple" in this context is defined at the end of the question;
- I believe this question is somehow not about dataset request even if its title is named this way...
Main body:
Datasets I have seen usually fall within one of the following three categories:
- "simple" models work (i.e., prediction consistently better than random guess, even by a tiny margin such as 1%), machine learning/deep learning models work better
- "simple" models work, machine learning/deep learning models are more or less the same
- "simple" models don't work (i.e., prediction not better than random guess), machine learning/deep learning models don't work either.
This rule does not only apply to tabular data but also apply to some simple image classification tasks as well (but yes, you may argue that underlyingly images are tabular data as well since we flatten the high-dimensional data to 1-D array anyway). I tried cats and dogs classification--logistic regression can do better than random guess although it is much worse than CNN (so it falls to category I). Since I work mostly with tabular data, I did not try some even more complicated image classification tasks.
Let's say we discuss only supervised classification using tabular datasets, that is, the datasets with a few (or a LOT if you wish...) columns as X and one categorical column as y. Under this constraint, do you know any publicly available or synthetically generated datasets with which a "simple" model does not work at all but a more complex model can do better (even just slightly) than random guess?
In the context of this question:
- "simple" means linear models (e.g., OLS, Ridge, Lasso, Logistic, etc) plus decision tree and k-NN. These models are considered "simple" mostly because they are computationally inexpensive;
- "linear" means linear with respect to X and no feature engineering, such as $x_2 = x_1^2$, is allowed. So that one can not add non-linearity to the model by manually creating new features from old olds--I believe this makes my question easier; otherwise feature engineering allows OLS to go much further;
- I confine the question to classification tasks only because the meaning of "random guess" is not very clear for a regression problem (we guess by mean? median? mode? frequency? how about continuity and differentiability?) Please let me know if I am wrong;
- Overfitting control techniques such as cross validation are allowed just in case the synthetic dataset tries to trick a simple model into severe overfitting, which is not the intent of this question;
- SVM, Gradient Boosting, etc are not considered "simple", but suppose you have a dataset or you can generate a synthetic dataset which fails these models but cannot fail a neural network, I will be even more interested to know what such dataset looks like.
Datasets tried:
- vanilla hand-written digit recognition: Category I
- obfuscated hand-written digit recognition: Category I
- obfuscated hand-written digit recognition with binary label: Category I
EDIT 1: The original term "linear" is changed to "simple", which reflects what I want to ask more accurately;
EDIT 2: k-NN is added to the list of "simple" models.