Imagine we have 10 input features/predictors, large sample size, and the following two scenarios:
Scenario 1: label/dependent variable is bionmial (binary classification problem).
Scenario 2: dependent variable being continuous (regression).
Now, the evaluation method for choosing best model is NOT about a model fit with least number of parameters that have comparable performance. So LRT, AIC, etc. aren't applicable here.
Instead, let's say what we're interested in is:
- Is there any evidence for pairwise or higher level/complex interactions between features?
- In other words, do we gain anything from going more complex, and if so, what percentage of performance is because of the added complexity.
To address this, for each scenario, one would fit a simple multivariate logistic/linear regression vs. a hierarchical model or even something like a neural net (interpretation is not important here, black-box is fine).
What is a suitable approach to compare these, and say, there is a difference, no matter how small, but it's there and we have some level of certainty about it (for example with some p value and the effect size due to inclusion of complex interactions).
UPDATE 1:
Assume we perform cross validation, etc. to avoid over-fitting.
Also, consider the problem being people believe that univariate effect of features dominate the prediction, and now I'm claiming that is NOT true? Or at least it's not that simple.
The actual data is much bigger in term of number of features.
All I'm saying is that, I want to compare a simple linear model with a complex non-linear model, and show that there is a difference and so there are interactions, etc. Finding those interactions is another question, for later.
If we have prediction performance of say $X$ (some performance evaluation metric) with the simple model, and performance of Y with the more complex model. Is it reasonable to claim $(Y-X)/Y$ of performance comes from the complex interaction between features? Since the more complex model includes the simpler one. Like a logistic regression vs neural net. If so, what is a correct way of quantifying that with uncertainty.
UPDATE 2:
Label Permutation: Based on @Kodiologist's answer that the complex model will always be better. What if I form this as a permutation test (permutation of labels)? Then this would make a good null distribution, because complex one is always better in that setting as well, so now I can compare the difference in performance from actual/real data with the distribution of differences generated by label permutations.