Coursera Linear Regression with gradient descent with R

Question

I am enrolled in a machine learning course for machine learning where we have a lab to implement linear regression I am attempting to do it in R to get a better understanding of the material and of R for myself (i don't intend to submit this as a lab as the course doesn't use R) but am coming up against a wall

My understanding of the process is as follows

User Generates a model based on the hypothesis $h_\theta(x) = \theta^TX= \theta_0x_0 +\theta_1x_1+\dots$
Take error rate of your model by using squared error cost function, then iterate, create a new hypothesis and get the error rate of this. Continue through $n$ iterations based on the formula $J(\theta_0,\theta_1)=\frac{1}{2m}\displaystyle\sum_1^m(h_\theta(x^{(i)})−y^{(i)})^2$.
Take all the error rates you have recorded based on the cost history and use gradient descent to find automatically the optimal values of your hypothesis.

Using the code on R-Bloggers where the gradient descent is implement below based on vectors x and y

# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))

# gradient descent
for (i in 1:num_iters) {
 error <- (X %*% theta - y)
 delta <- (t(X) %*% error) / length(y)
 theta <- theta - alpha * delta
 cost_history[i] <- cost(X, y, theta)
 theta_history[[i]] <- theta
}

I was wondering if people could help me tease out the logic

Why is the number 1 applied to the matrix X. Is this so that X has 2 columns so that it can be multiplied by theta - y?
What is the formula delta actually calculating and why is the Transpose of X being used

Conceptually I think i understand the overall process but i just need to relate this back to the R code as i want to grasp the concept before proceeding to Multiple linear regression

The code comment should answer your first question. `X` is the [design matrix](https://en.wikipedia.org/wiki/Design_matrix) and it needs a column of ones for the intercept. The second question you can best answer yourself by going through the matrix algebra on a sheet of paper. — Roland, Nov 13 '15 at 12:40
Hi @Roland, Can the intercept be initialized to any number in that case? — John Smith, Nov 13 '15 at 12:50
You are still not understanding what a design matrix is. This column of the design matrix will be multiplied with the fitted coefficient for the intercept when you calculate predictions (`X %*% theta`). You could use any number (except 0) for this column, but that would rescale the coefficient for the intercept and make it harder to interpret. R's `lm` function creates the design matrix automatically and of course uses a column of ones. The coefficients in your example are initialized as `theta — Roland, Nov 13 '15 at 13:22
By now you probably are far beyond this, but I thought it could help future self-taught learners going through the Andrew Ng lectures to have a detailed exposition of your second question. — Antoni Parellada, May 29 '16 at 22:51

score 3 · Accepted Answer · edited Jun 11 '20 at 14:32

Answer to your question 1

Appending $\bf 1$ to matrix $\bf X$ is adding the "intercept" term. Suppose you have $p$ features in data, without adding $\bf 1$ term, you are actually fitting $$y=\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$

With appending $\bf 1$, you are fitting.

$$y=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$. Similarly, you can try following two models in R to see difference

lm(mpg~wt,data=mtcars)

vs.

lm(mpg~wt-1,data=mtcars)

where first formula gives a fit of $mpg=\theta_0+\theta_1*wt$ and second formula gives a fit of $mpg=\theta_1*wt$

Answer to your question 2

In the Coursera course Andrew Ng spent a lot of time on iterative methods, instead of "analytical solution" / "normal equations". Which means he is teaching you how to get those weights by some algorithms that runs in many iterations. Rather than deriving the answers from algebra and calculate the weights from the formula.

In linear regression case, using iterative method may not be necessary, (in fact R is not using it, R is using QR decomposition and solve it directly instead of gradient decent), but in many other complicated models, say neural network, iterative methods are extremely useful.

PS. just found that more information can be found in this related post

Antoni Parellada · Answer 2 · 2016-05-30T14:43:49.827

Why is the number 1 applied to the matrix X. Is this so that X has 2 columns so that it can be multiplied by theta - y?

Including an intercept or "bias term" is routine in linear regression to allow flexibility in the fitted line. Otherwise the line would always go through the origin. Hence, it is the presence of this column of $1$'s that yields the estimated parameter $\hat \beta_0$, or in your terminology, $\theta_0$. The column of $1$'s is not a trick to accommodate the $\theta_0$ in the matrix multiplication of the residuals or error term in your OP.

What is the formula delta actually calculating and why is the Transpose of X being used

delta in your code stands for the partial derivatives of the cost function with respect to the $\theta_i$ parameters that we will use in gradient descent. From the formula for the cost function in the OP, $\nabla J(\theta_0,\theta_1)=\nabla \frac{1}{2m}\displaystyle\sum_1^m(h_\theta(x^{(i)})−y^{(i)})^2$, the partial derivative with respect to $\theta_0$ is: $\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\displaystyle\sum_{i=1}^m\big(h_\theta(x^{(i)})−y^{(i)}\big)\color{red}{1}$, where the superscript $(i)$ denotes individual measurements or training "examples."

And with respect to $\theta_1$it is $\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\displaystyle\sum_{i=1}^m\big(h_\theta(x^{(i)})−y^{(i)}\big)\,\color{red}{x_1^{(i)}}$. In your example linked to R-Bloggers there is only one feature, $x_1$. Now, in linear algebraic notation this can be expressed as: $X^T(X\theta-y)$, noticing that $X\theta$ is a matrix multiplication $[\text{m}\times\text{n}][\text{n}\times1]$ of $\text{m}$ examples and $\text{n}$ features, corresponding to calculating the predicted values or "hypothesis" function. $X\theta-y$ are the errors, corresponding to your line of code error <- (X %*% theta - y).

Finally, $X^T$ would express the red part of the expressions of the partial derivatives above. In this regard, notice that since the first column of $X$ is the column of $1$'s, now it is the first row, ready to yield the first derivative when $X\theta-y$ is left multiplied by $X^T$. Notice that $X^T$ is $[\text{n}\times\text{m}]$ and $X\theta-y$ is $[\text{m}\times1]$, yielding a partial derivative for each feature $n$, or column in the initial $X$ matrix, including the intercept or bias feature. This is encapsulated in delta <- (t(X) %*% error) / length(y).

Coursera Linear Regression with gradient descent with R

2 Answers2

Answer to your question 1

Answer to your question 2