In gradient descent, how can a function with higher cost better fit data than a lower cost one?

Question

Based on the Coursera Course on Machine Learning, I implemented batch gradient descent using python. The progression of $J(\theta)$ is expectedly decreasing (which suggests that my implementation is correct), but the final $\theta$ given by my implementation yields the blue line below, with a cost of ~12, while a more reasonable fit given by the green line below has a cost of ~72.

How can this be?

Here is the data I used:https://justpaste.it/ulce

And the cost function: $\frac{1}{2m}\sum_{i = 1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$

implemented in python as:

def costFunction(x, y, theta):
    sum = 0
    for i in range(len(x)):
        sum += (np.dot(x[i,:],theta) - y[i])**2
    return (sum/(len(x) *2))

The data can be accessed here:

library(gsheet)
data <- read.csv(text = 
gsheet2text('https://docs.google.com/spreadsheets/d/12AwTiqx_IJzhFqp3baMk3LxvYr2gUCA73RUCI5tTXhw/edit?usp=sharing',
format ='csv'))

UPDATE:

My implementation works well when I don't preprocess the data. My process of "extrapolating" $\theta$ (if I preprocess) must be poor. This is what I did:

I transform every sample in the feature (there is only one feature) using:

$X_{\alpha} = \frac{X - X_{min}}{X_{max} - X_{min}}$

These are then the relevant values to extrapolate:

$\theta = [5.670, 2.301] = [\theta_{0}, \theta{1}]$

$X_{min} = 5.0269$

$X_{max} = 22.203$

When I go to plot, I simply plot $(X, h_{\theta}(X_{\alpha}))$:

x = pylab.linspace(0,30, num = 1000)
x1 = (x - 5.0269)/(22.203 - 5.0269)

y = 5.67002243 + 2.301*x1
plt.plot(x,y)

It's going to be impossible for us to say, given this information. Can you post the data? What were the cost functions? — gung - Reinstate Monica, May 24 '16 at 14:08
Are you sure it's not a gradient descent problem? Gradient descent is absolutely not the preferred way to solve this problem. — Sycorax, May 24 '16 at 14:10
Is this the Andrew Ng ML course? If so, which assignment is it? — Antoni Parellada, May 24 '16 at 14:44
@Antoni Parellada. Yes, Andrew Ng's ML course. This is assignment #2. Will the post the data and cost function used. — Muno, May 24 '16 at 15:11
"Linear Regression Assignment: compute cost for one variable"? — Antoni Parellada, May 24 '16 at 15:14
@gung Just posted everything that might be of help. I can also post my gradient descent implementation, if needbe. Thanks. — Muno, May 24 '16 at 17:20
Have you checked [this](http://stats.stackexchange.com/q/25949/67822)? I know it is not the answer to your question, but it is interesting. — Antoni Parellada, May 24 '16 at 18:54
i just implemented your dataset in my gradient_descent algorithm and it works pretty well i think you have problem in your code [check it out maybe its helpful ](http://pastebin.com/64ip42ff) — user3741124, Oct 21 '16 at 23:48

Matthew Gunn · Answer 1 · 2016-05-24T14:45:35.150

It can if your cost function code has bugs!

Looking at the picture and assuming those are all the points, I am fairly confident that the blue line is indeed not the best fit line (assuming your cost function $c(\theta, X)$ is some sensible convex function of some sensible notion of the difference between the line and the points).

That your code says the blue line has lower cost than the green line suggests either:

Your graph is wrong?
There are other points you're not graphing? (eg. some far outlier in the top left corner that gives the green line huge cost?)
Or your cost function has bugs.

score 0 · Answer 2 · answered May 24 '16 at 14:53

0

I am guessing you are not including the bias term!!

Basically you are restricting your line to pass the origin. The blue line is the best you can get by satisfy this condition, but green line is not.

Try to add a all 1 column to your data and run again.

(BTW, another possibility is you have the "regularization term", please double check if your $\lambda$ is 0.0)

answered May 24 '16 at 14:53

Haitao Du

32,885
17
118
213

5

Presuming the graph is correct and complete, I have a hard time seeing how the blue line is the best fitting line to pass through the origin (if it even does pass through the origin, which it doesn't appear to on the graph, of course presuming correct graphing). – Mark L. Stone May 24 '16 at 15:00
1

@hxd1011 I do include the bias term. I also do preprocess the data before finding appropriate values for $\theta$, but my understanding is that we need not preprocess random features (to test our model) when we graph them. – Muno May 24 '16 at 15:21

In gradient descent, how can a function with higher cost better fit data than a lower cost one?

2 Answers2