What might be the simplest (least flexible, least expressive) model to avoid over-fit?

Question

I need to perform a regression on a data set with a huge noise-to-signal ratio. I am not even sure if there is any "signal" in the data (maybe there is only "noise" in the data), meaning that it might be the case that distribution of targets does not depend on my features at all.

So, in this case I need to work with very simple models but I am afraid that even linear regression is to complex for me. For example, if I have 20 features, I do not believe that I can reliably extract 20+1 parameters from my data.

To simplify my model further, I search a linear function that only depends only on one of my 20 features. However, I am not sure that this is the best way to go.

So, my question is, if there a class of even simpler (less flexible, less expressive) models that are suitable for fitting very very noisy data?

My intuition goes as follows. If we have a logical statement like: "A and B and C", we can simplify it to "A and B" and then we can simplify it to "A". At this point we might think that we cannot say less than just "A" but we are wrong! We can say even less, namely "A or K" or even less: "A or K or L or M".

What do you mean by searching a "linear function that only depends on one of my 20 features"? Does it mean that you are running linear regressions with one covariate at a time? — philbo_baggins, Dec 27 '21 at 15:13
@philbo_baggins, it means that I loop over my features and for each of the feature I search for the best linear function. In the end I choose the one feature that gives the best result. So, in the end I have a function with 2 parameters (instead of having 21 parameters). — Roman, Dec 27 '21 at 16:47
The simple answer to your question about the "simplest" possible model is yes: and that class demonstrably consists of the constants. What could be simpler? Moreover, almost by definition of "too noisy," these are the only models that would not be over-fit. — whuber, Jan 05 '22 at 20:59

Dikran Marsupial · Answer 1 · 2021-12-30T17:48:33.330

4

Use ridge regression. The ridge parameter is a control on the complexity of the model class. Ridge regression is equivalent to fitting a regression model with a constraint on the norm of the weight vector, where the constraint depends on the value of the ridge parameter. If you reduce the ridge parameter slightly, this raises the maximum norm of the weights of the model class slightly, so it can realise any linear function that it could previously realise, and also some more linear functions that require a slightly greater norm. This means ridge regression provides a nested set of model classes of increasing complexity, indexed by the ridge parameter. That is why it is effective in avoiding over-fitting.

I explain the connection between regularisation (as used in ridge regression) and constrained optimisation in my answer to another question.

edited Dec 30 '21 at 17:48

answered Dec 30 '21 at 17:22

Dikran Marsupial

46,962
5
121
178

2

Could you elaborate how the ridge parameter controls the "complexity"? If by "complexity" you mean the number of nonzero coefficient estimates, then it would seem you want to recommend a Lasso or elastic net, because Ridge regression does not necessarily force any estimate to zero. – whuber Dec 30 '21 at 17:27
@whuber, thanks for the suggestion, I have added an explanation based on the norm of the weight vector. – Dikran Marsupial Dec 30 '21 at 17:28
1

I still don't follow what you might mean by "nested set of models:" typically they all wind up being exactly the same model, only with different parameter estimates. – whuber Dec 30 '21 at 17:30
@whuber I'm not sure what the correct statistical terminology would be. In computational learning theory I think they would use "hypothesis" and "hypothesis class" for classification tasks. – Dikran Marsupial Dec 30 '21 at 17:33
@whuber essentially I am using complexity of a model class to be the size of the set of functions that can be realised by the model class (I've edited the answer to use "model class" rather than simply "model", but I am not sure that is the correct terminology). If one model class can realise a larger number of functions then it must be more complex. L1 regularisation has the advantage of giving a structurally less complex solution, but it still creates a nested set of model classes of increasing complexity even before attributes are pruned. – Dikran Marsupial Dec 30 '21 at 17:51
I would be interested in an example of a model that is in the ordinary least squares regression class but not in the ridge regresssion class (with nonzero ridge parameter, of course). That would help me understand your distinction. – whuber Dec 30 '21 at 18:30
@whuber in this case the ridge regression is not a single class but a family of classes indexed by the ridge parameter. For a given value of the ridge parameter there is a class of models that can be implemented whilst observing the corresponding constraint on the norm of the weight vector. When we fit the model to the data, we are finding the model from that class that minimises the unregularised loss. This is a rather different way of looking at complexity from the usual statistical approach, and it probably doesn't help I don't know the statistical terminology that well. – Dikran Marsupial Dec 30 '21 at 20:16
I'm using "model" here to mean the model architecture and it's parameter values. – Dikran Marsupial Dec 30 '21 at 20:20
ordinary least squares regression would be the outermost of the nested set of model classes, and therefore the most complex, as any linear function that can be implemented by a ridge regression model can be implemented by OLS. – Dikran Marsupial Dec 30 '21 at 20:27
I don't see that these distinctions are meaningful in most applications, because the ridge parameter must be determined from the data, too. Thus there is *no* limitation on the model at all. – whuber Dec 30 '21 at 20:28
The idea of Structural Risk Minimisation is that we should match the complexity of the model class to that required for the difficulty of the learning task, which we do in the case of ridge regression by tuning the ridge parameter, just as we do for the Support Vector Machine by tuning the C hyper-parameter. In practice the procedure amounts to the same thing, but it explains why the ridge parameter is providing capacity/complexity control, and that model complexity is not as simple as just counting parameters. – Dikran Marsupial Dec 30 '21 at 20:32
1

The excitement in ML about the SVM in the late-nineties and early-noughties is somewhat ironic, given that ridge regression is also SRM, and the kernel trick can be applied to ridge regression as easily as it can to the maximal margin classifier. Kernel ridge regression (a.k.a. Least-Squares Support Vector Machine) is one of my favourite regression and classification tools. The two approaches have many similarities (and AFAICS a fair bit of the nice SVM theory is invalidated by tuning the kernel parameters anyway). – Dikran Marsupial Dec 30 '21 at 20:35
How would you choose the ridge parameter? Cross validation? Or something else? – frelk Jan 05 '22 at 01:14
@frelk Yes, I normally use virtual leave-one-out cross-validation (Allen's PRESS statistic), mostly because it can be evaluated very cheaply in *canonical form* (take an eigen-decomposition and each value of the ridge parameter can be evaluated analytically in only $\mathcal{O}(n)$ operations). – Dikran Marsupial Jan 05 '22 at 09:41

BelwarDissengulp · Answer 2 · 2021-12-30T17:54:17.553

You can try to apply feature selection. Linear models are already the simplest functional form that you can choose, but techniques like LASSO, non-negative garrote and other penalized/restricted estimation methods try to reduce dimensions, keeping only the most "worthy" covariables of the model. Penalized least squares has the following objective function,

$$g(\beta;\lambda) = ||Y-X\beta||_2^2 + \lambda||\beta||_k$$ $$\hat{\beta}_{PLS} = \arg\min_{\beta\in \mathbb{R}^p} g(\beta,\lambda)$$

such that the model becomes the LASSO when $k=1$ or the ridge regression when $k=2$. There are multiple extensions that apply less common penalties or work with different values of $\lambda$ for each variable, that is also a multitude of side-methods to choose the $\lambda$.

If your data matrix is close to singularity, then ridge ($L^2$ penalty) is a good option. If your interesting in feature selection, then LASSO is the way to go ($L^1$ penalty). Elastic net is a convex combination between penalties and gives better estimates in some cases, $\alpha||\beta||_1 + (1-\alpha)||\beta||_2^2$.

Another option that is similar to the LASSO is the Dantzig selector and I will leave the article by Terence Tao that gives a pretty good explanation:

https://terrytao.wordpress.com/2008/03/22/the-dantzig-selector-statistical-estimation-when-p-is-much-larger-than-n/

EDIT

As I said in the comments, there's a duality between penalties and restrictions over the parameter space of the form $||\beta||_k \leq C$, I'm adding an image that illustrates the $L^p$-balls for various $p$s,

Shrinkage estimation is a huge research area and there are lots of results that show that penalized estimators often have smaller risk (expected loss) than common procedures, specially in high $p$, low $n$ situations. Just for curiosity, take a look at the James-Stein estimator.

If you want a model with only one covarible as you mentioned, you may raise $\lambda$ incrementally until you get only on $\beta\neq 0$.

Other valid methodology is to apply some kind of screening based on correlations, t-tests or other kind of association measures that rank variables according to relevance before estimation.

+1 but for high noise situations where none of the features may be relevant I'd just use ridge regression as it is likely to be more stable, especially if the size of the training data were small. — Dikran Marsupial, Dec 30 '21 at 17:38

score 1 · Answer 3 · answered Jan 03 '22 at 12:38

No overfitting

I search a linear function that only depends only on one of my 20 features

There is not much overfitting when you search for only a linear function

$$y_i = a + b x_{i}$$

Overfitting is not a problem in this case because you do not have hyperparameters that you can tune.

Your problem is more straightforward and only about picking the best feature.

The question of overfitting, and the bias-variance tradeoff, is not relevant here because there is no bias that can make the model less flexible and less variable (except for such techniques as shrinking or Bayesian modeling which could also constrain a one-dimensional model).

Overfitting would be more worrisome if you would be fitting more complex functions of your features (e.g. fit polynomial models). In that case, the hyperparameter would be the order of the polynomial and you can change this.

Or, overfitting could occur if you would allow the use of multiple features, in which case the hyperparameter is in the number of features that you include in the model. (but your case seems to have been restricted to only one single feature)

So in your case,

You can simply perform the fitting repeatedly with different features $x_i$ and select whichever is the best-performing feature (for this you need to have some measure of performance, and ideally also have an idea about the distribution in order to estimate the significance, or use some part of the data to estimate the significance).

Whether this is a good approach is a different question, but if this is the constraint of your question (find a linear function with only one single feature) then there is not much better solution possible than simply selecting the feature that fits your data the best.

I have so much noise that even a linear function of one feature could be an overfit for me. I am not even sure that there is any dependency of the target on features. I though that maybe an ensemble of simple models is less prone to overfit... — Roman, Jan 03 '22 at 15:05
@Roman What are you using it for? You can not make the model a lot more simpler. If you step back from one to zero features then you are only left with an average. You could possibly use some shrinking estimator for the case of a single feature, but what is the point? Are you making some predictions and how is it valued if you make a mistake? — Sextus Empiricus, Jan 03 '22 at 15:24
In addition, you might have a lot of noise, but you could possibly still predict an average very well. The amount of noise does not necessarily mean that your estimate of a trend will be very wrong. You can still determine a significant effect that is not over-fitting. It is just that the effect is small relative to the noise. — Sextus Empiricus, Jan 03 '22 at 15:27

score 0 · Answer 4 · answered Dec 30 '21 at 17:49

I will assume that you want to avoid overfitting so that your model is the most predictive on new data, rather than being overfit to the training data. Rather than constraining the class of models you are willing to consider at the outset, you may want to fit multiple models and then choose the model that "performs the best" on out-of-sample data. You will need some loss function to assess which "performs the best" out of sample, such as mean squared error. You can estimate how well your model performs on new data by splitting into training and test sets, by using cross validation, or related methods. Candidate models might include those already mentioned, such as ordinary regression, LASSO regression, ridge regression, elastic net etc. but you could also try others. Custom software in R, such as the glmnet package, should make it relatively easy to do a lot of the above.

I don't think this is a good idea as it is easy to over-fit the validation data used to choose which model to use. If you are going to use multiple models, I'd use model averaging (ensemble methods) rather than selecting the best model, especially in high noise-signal situations. That way you can also use the "out-of-sample" data for training the models. — Dikran Marsupial, Dec 30 '21 at 17:56

What might be the simplest (least flexible, least expressive) model to avoid over-fit?

4 Answers4

No overfitting