2

In multiple regression problems, the decision variable, coefficients $\beta$, can be regularized by its L2 (Euclidean) norm, shown below (in the second term) for least squares regression. This type of regularization reduces overfitting by reducing variance in the overall loss function (regularized mean squared error (MSE)) through the introduction of bias in the coefficients, while also treating multicollinearity.

$$\beta = \min_\beta \enspace \frac{1}{N}\|y - X \beta \|_2^2 + \lambda \| \beta \|_{2}^2$$

What if, instead of norm regularization, we were to use entropy regularization of the coefficients, given that entropy by itself is usually $H(X) = \sum_{i=1}^N p(x) \ln{p(x)}$. The betas replace probabilities though so that they are also non-negative and sum to 1.

$$\beta = \min_\beta \enspace \frac{1}{N}\|y - X \beta \|_2^2 \pm \lambda \cdot \beta \ln \beta$$

I use $\pm$ for the last term because I don't know yet whether applications would perceive maximization or minimization of entropy to be useful.

  • What are the effects and numerical properties of entropy regularization as opposed norm regularization?
  • Are there any research models that actually apply entropy regularization to non-classification problems? (I have found it used for classification but not regression)
  • What happens when both types of regularization are used, for a total of 3 terms (instead of 2) in the objective function?
develarist
  • 3,009
  • 8
  • 31
  • as you've written it, this doesn't make sense because $\beta$ is not a distribution, you can't compute it's entropy – shimao Aug 26 '20 at 13:08
  • that's the initial reaction, but there are several papers that treat the coefficients/weights like how probabilities would be treated, as long as their application also call on them to be non-negative and sum to 1 like probabilities. I could link some articles i found by searching entropy regularization if you like, while others need shibboleth access. An equally-weighted $\beta$ filled with $1/p$s for example would correspond to a uniform distribution. – develarist Aug 26 '20 at 13:15

0 Answers0