Is there a hard distinction between hyperparameter vs parameter in machine learning?

Question

I was watching Andrew Ng's lecture on the difference between parameter vs hyperparameter, https://www.youtube.com/watch?v=VTE2KlfoO3Q&ab_channel=Deeplearning.ai, and a question came to me.

Is there really that much of a distinction between hyperparameter vs parameter?

For example, weight is often regarded as a parameter as opposed to a hyperparameter. But recent papers have found that random search of the weight can obtain a good result, and beats state-of-the-art optimization methods https://arxiv.org/abs/1803.07055 Is this not the same method for hyperparameter tuning?

Simultaneously, there are papers that tunes the learning rates, optimizers and other so-called "hyperparameters" associated with a model. https://arxiv.org/abs/1606.04474

Then there are methods that directly learns the hyperparameters through gradient based methods. https://arxiv.org/abs/1903.03088

Another inspiration is adaptive control (huge field, spanning 5 decades now), the so-called "hyperparameters" associated with the controller is always learned.

Samir Rachid Zaim · Accepted Answer · 2020-10-02T03:33:35.893

That's a great question - I'm not sure what the best way to answer this, but in a statistical framework, I believe the differences are a bit more clearly cut. I'll be curious to see how others answer this from a purer ML/DL perspective.

I think one way in which they differ is that parameters (at last from a statistical standpoint) are something on which you can make inference on, whereas a hyper-parameter is an element of the algorithm that is tuned to optimize it.

For a concrete example, say you are running a LASSO-type penalty for a linear regression model. The $\beta$ weights/coefficients are parameters as one can make a decision on the estimated values and determine relevance or directionality (i.e., check which coefficients are not 0 in a LASSO procedure, or which "protect agaisnt" vs. "increase" risk). Using the same LASSO example, the $\alpha$ weight on a penalty function can be considered a hyper parameter, since the actual value of the $\alpha$ would not provide any insight into the model/post-hoc analysis.

This is a bit of a "statistical" perspective of a difference b/w what is a parameter vs. a hyper-parameter, though that's one option with how to differentiate. With non-parametric algorithms, decision trees, and neural networks, this is where I think there are more gray areas.

Even in the "statistical" perspective, [the distinction may not always be clear](https://en.wikipedia.org/wiki/Hyperprior), except in a relative sense. — GeoMatt22, Oct 02 '20 at 04:59
Bayesian priors and hyperpriors are a whole other monster I didn't even want to touch on haha. It's a very gray area, but excellent point. +1 — Samir Rachid Zaim, Oct 02 '20 at 14:33

Is there a hard distinction between hyperparameter vs parameter in machine learning?

1 Answers1