Why does Feature Scaling work?

Question

If I take a very basic example where my feature Matrix X is

$$ \begin{matrix} 1 & 100 & 0.25\\ 1 & 110 & 0.5\\ 1 & 120 & 0.75\\ 1 & 130 & 1\\ 1 & 140 & 1.25\\ \end{matrix} $$

and the expected output vector Y is

$$ \begin{matrix} 201.75\\ 222.5\\ 243.25\\ 264\\ 284.75\\ \end{matrix} $$

Then if y = $\theta_0$ + $\theta_1*x_1$ + $\theta_2*x_2$ clearly solves to [$\theta_0$, $\theta_1$, $\theta_2$] = [1, 2, 3], but if I apply feature scaling, I am essentially changing the values of my feature matrix X and therefore I will get very different values for [$\theta_0$, $\theta_1$, $\theta_2$] which when plugged back into the equation y = $\tilde{\theta_0}$ + $\tilde{\theta_1}*\tilde{x_1}$ + $\tilde{\theta_2}*\tilde{x_2}$ do not yield the resultant vector Y.

Now I know that feature scaling works and I am thinking something the wrong way. So I need someone to point out what is wrong with my very basic understanding above.

And if you are inclined to respond with the standard "Google It" response, please have a look at the below links I have gone through without getting the answer to this:

How and why do normalization and feature scaling work?

https://stackoverflow.com/questions/33777336/do-we-need-to-scale-output-variables-when-doing-gradient-descent-with-multiple-v

Is it necessary to scale the target value in addition to scaling features for regression analysis?

https://www.internalpointers.com/post/optimize-gradient-descent-algorithm

http://www.johnwittenauer.net/machine-learning-exercises-in-python-part-2/

Please distinguish the scaled and unscaled systems with e.g. `\tilde{x}` = $\tilde{x}$ and `\tilde{\theta}` = $\tilde{\theta}$. — Jim, Apr 24 '18 at 11:27

score 2 · Accepted Answer · answered Apr 24 '18 at 10:19

2

In your example, if you apply feature scaling, you should apply it to $X$ before solving the equation, and you will get different weights. Then to generate new predictions, you have to scale an observations' features (using the mean and standard deviation you established earlier), and apply the multiplication after that.

In the case of a linear regression, feature scaling does not change the predictions, but it does make the interpretation of the coefficients easier. If you want to apply regularization or different algorithms (such as knn), feature scaling is important and can make a large difference.

answered Apr 24 '18 at 10:19

Gijs

3,409
11
18

1

So you mean that when I generate new predictions for values of [$x_1, x_2$], I will need to scale them using the mean and standard deviation of my **training set**, before I multiply them with the theta values I calculated using Linear Regression? – Prashant Apr 24 '18 at 12:23
2

@Prashant yes exactly – Jonny Lomond Apr 24 '18 at 13:16
Thanks for all your help. It worked like a charm. I should have been able to figure this out by myself. Thanks again for the great and prompt advise. – Prashant Apr 24 '18 at 13:36

Why does Feature Scaling work?

1 Answers1