3

I am running the example found here.

The training data for the model can be found in this CSV, where the first column is the response variable and the second column is a space separated list of predictors.

After running the example with the modified parameters of numIterations = 1000 and stepSize = 1, the result it gives is:

intercept = 0.0
weights   = [0.5047290722337779, 0.1920796185138009, 0.3055517342539938, 0.10461255324752054, 0.4918056810542394, -0.21803087595354512, -0.059099512695448504, 0.0]

The trained model is very poor. Furthermore, it is different from the exact solution given by (XTX)-1XTy for least-squares problem. The actual solution using this formula is:

intercept = 2.46493292212375
weights   = [0.679528141237974, 0.263053065732544, -0.141464833536172, 0.210146557221827, 0.305200597125096, -0.288492772453545, -0.021305038802947, 0.266955762119923]

Nothing I've seen in the documentation of LinearRegressionWithSGD says the model is not an OLS model, so I assume it is. The documentation here says the model minimizes the MSE.

The model has a null intercept and the last feature is dropped from the model, which makes me think Spark is internally normalizing the data, training the model on the normalized data, and returning that model. Does anyone have any experience with this?

I have tried comparing this result those provied by lm in R and the linear regression in the Spark's ML library. Both of them return the latter result.

> lm(y ~ x, data)

Call:
lm(formula = y ~ x, data = data)

Coefficients:
(Intercept)          xV2          xV3          xV4          xV5          xV6          xV7          xV8          xV9  
    2.46493      0.67953      0.26305     -0.14146      0.21015      0.30520     -0.28849     -0.02131      0.26696

If I force the fitting in R to also lack an intercept term, the result is still different:

Call:
lm(formula = y ~ x - 1, data = data)

Coefficients:
    xV2      xV3      xV4      xV5      xV6      xV7      xV8      xV9  
 0.5999   0.1859   0.2808   0.1108   0.4003  -0.5932  -0.6133   0.9169 
Jon Claus
  • 535
  • 1
  • 4
  • 12
  • 2
    The solution provided by $(X^T X)^{-1} X^T y$ is not always numerically stable. I believe most statistical software packages use a Cholesky decomposition. I recommend running a regression in R or SPSS and comparing those results with what Spark provides. – Jon Feb 01 '17 at 18:33
  • Running `lm` on the raw data in R produces the latter result. I'll edit my post to include more details. – Jon Claus Feb 01 '17 at 19:23
  • Could you include the results from R as comparison? – Jon Feb 01 '17 at 19:27
  • Can you please update the R output to include a model described by hxd1011? lm(formula = y ~ x-1, data = data). Otherwise, your code compares two different models. – Jon Feb 01 '17 at 19:57
  • I've added that as well. – Jon Claus Feb 01 '17 at 20:00
  • 1
    @Jon FYI R's `lm` uses QR decomposition by default. – Sycorax Feb 01 '17 at 21:40
  • I'm aware that they use different methods but they should achieve the same result. The loss function that SGD minimizes is convex, so any local minima it converges to is also a global minimum. This means that at the very least the SGD result should be at least as the QR decomposition (goodness being measured by MSE), but it is not. – Jon Claus Feb 01 '17 at 21:53
  • @Sycorax, yeah I know. I couldn't edit my comment; hxd1011 also pointed that out in his answer. – Jon Feb 01 '17 at 22:18
  • Do you find any solution for this issue? I have same problem.. – sahil desai Oct 18 '19 at 09:11

1 Answers1

1
  1. In your spark results, the intercept is $0$, are you enforcing the model pass though the origin? (for getting to get all $1$ column in your data)? try this in R lm(formula = y ~ x-1, data = data) to see what coefficient you get.

  2. One is using QR decomposition (R) another is using stochastic gradient decent (Spark), we may expect some differences.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Removing the intercept in R still yields different results. Although they are using different methods, both algorithms should converge to the same solution. – Jon Claus Feb 01 '17 at 19:32
  • @Jon Claus. I do not agree different algorithms will converge to same solution. and we may not sure SGD is converging. – Haitao Du Feb 01 '17 at 19:44
  • Changing the number of iterations for the SGD to 10000 does not noticeably change any of the coefficients. While that does not prove that the SGD has converged, it does suggest it. – Jon Claus Feb 01 '17 at 19:56
  • @Jon Claus After reviewing the Spark code and documentation, I would have to agree with hxd1011. If you want coefficients that match QR methods, you may want to compare results from http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression and not SGD. – Jon Feb 01 '17 at 20:01
  • I'm aware of the Elastic Net implementation in the ML library. I was hoping to avoid that and use MLLib as I've been told operating on RDDs is faster. SGD converges to a local minima of the error function, but isn't it also guaranteed that the local minima is the global minima as well? – Jon Claus Feb 01 '17 at 20:07
  • 1
    @JonClaus OLS is (strictly) convex but if your SGD step size is too large, you can plausibly just bounce around the minimum rather than actually achieve it. – Sycorax Feb 01 '17 at 21:24
  • 1
    @JonClaus What is the condition number of X? If your problem is underdetermined, iterative numerical methods like SGD (or even trying to calculate the exact solution containing the inverse) could yield different solution from QR. Otherwise, it would only make sense that the two have similar answers. – Mustafa Eisa Feb 02 '17 at 09:00