5

When trying to predict data using linear regression or classify with logistic regression, with a polynomial, I know how to find the best degree of a polynomial to fits given data when the regularization coefficient is fixed. I also know how to find the best regularization coefficient when the degree of the polynomial is fixed.

What I want to know is how to find the best model when none of these parameters are known.

  • Should I find the best degree without regularization first, then the regularization parameter ?
  • Should I, for every degree, train with every possible regularization parameter value (assuming it belongs to an ensemble of discrete values), and then pick the combination degree/regularization that had the best results on the validation set ?

Or is there a better solution to find these hyperparameters ?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Carl Levasseur
  • 523
  • 1
  • 5
  • 7

2 Answers2

1

I would treat this as a standard cross-validation task where we optimise both the degree of the polynomial $k$ as well as the associated regularisation parameter $\lambda$ at the same time. Hyper-parameter optimisation procedures do this all the time.

An immediate example is when optimising SVM where the degree of the polynomial kernel $k$ is optimised along-side the regularization parameter $\lambda$. There is a rather large literature on the matter; Wainer & Cawley (2017) Empirical evaluation of resampling procedures for optimising SVM hyperparameters is relatively concise recent work, I found it very readable. Chapelle et al. (2002) Choosing Multiple Parameters for Support Vector Machines offers a more formal treatment if you want to explore this further. (Sometimes the regularisation parameter $\lambda$ is denoted by $C$, the inverse of it.)

Regarding the parameter search routine: Aside standard grid-search it is probably worth looking into Bayesian Optimisation approach. CV.SE has a great thread on the matter: Optimization when Cost Function Slow to Evaluate where the main mechanics of Bayesian Optimisation are presented. Particular for the case here, we will effectively fit a two-dimensional Gaussian Process against the parameters $\lambda$ and $k$.

Two final points:

  1. $k$ is discrete. A quick and dirty solution is to just "round/floor/ceil" the associated estimate. That works but it can be on occasion misleading. There is some very recent work on the subject (e.g. Garrido-Merchán & Hernández-Lobato (2020) Dealing with Categorical and Integer-valued Variables in Bayesian Optimization with Gaussian Processes, Luong et al. (2019) Bayesian Optimization with Discrete Variables).
  2. The final estimate from BOpt or any other hyper-parameters grid-/random- search procedure will probably not be the MLE of that linear model. That is not the end of the world but if we want to use certain follow-up statistical procedures that assume MLE it would be reasonable to make one final optimisation step with $k$ being fixed and optimising for $\lambda$ only.
usεr11852
  • 33,608
  • 2
  • 75
  • 117
0

Since u are using regularization for feature selection I guess I would find the best regularization parameter for each order and then select the model with the smallest testing error. However, I think if u use a high order polynomial first and choose the best regularization parameter for that specific order you can get some insight.

amanita kiki
  • 111
  • 8