0

In regularization, we add square of thetas multiplied by lambda(excluding theta_0). The value of lambda is high because values of theta should be close to zero to neglect the value of its associated feature. Now my question is when we apply gradient descent to set the values of theta which will result in best fit to data. Wouldn't it also reduce the values of all the thetas (which we don't want to reduce)? Because we are adding every value of theta at the end of the cost function. Which will result in a very low slope linear line causing underfitting.

  • Does this answer your question? [Why does shrinkage work?](https://stats.stackexchange.com/questions/179864/why-does-shrinkage-work) – Arya McCarthy Apr 03 '21 at 15:12

0 Answers0