I am running the example found here.
The training data for the model can be found in this CSV, where the first column is the response variable and the second column is a space separated list of predictors.
After running the example with the modified parameters of numIterations = 1000
and stepSize = 1
, the result it gives is:
intercept = 0.0
weights = [0.5047290722337779, 0.1920796185138009, 0.3055517342539938, 0.10461255324752054, 0.4918056810542394, -0.21803087595354512, -0.059099512695448504, 0.0]
The trained model is very poor. Furthermore, it is different from the exact solution given by (XTX)-1XTy for least-squares problem. The actual solution using this formula is:
intercept = 2.46493292212375
weights = [0.679528141237974, 0.263053065732544, -0.141464833536172, 0.210146557221827, 0.305200597125096, -0.288492772453545, -0.021305038802947, 0.266955762119923]
Nothing I've seen in the documentation of LinearRegressionWithSGD
says the model is not an OLS model, so I assume it is. The documentation here says the model minimizes the MSE.
The model has a null intercept and the last feature is dropped from the model, which makes me think Spark is internally normalizing the data, training the model on the normalized data, and returning that model. Does anyone have any experience with this?
I have tried comparing this result those provied by lm
in R and the linear regression in the Spark's ML library. Both of them return the latter result.
> lm(y ~ x, data)
Call:
lm(formula = y ~ x, data = data)
Coefficients:
(Intercept) xV2 xV3 xV4 xV5 xV6 xV7 xV8 xV9
2.46493 0.67953 0.26305 -0.14146 0.21015 0.30520 -0.28849 -0.02131 0.26696
If I force the fitting in R to also lack an intercept term, the result is still different:
Call:
lm(formula = y ~ x - 1, data = data)
Coefficients:
xV2 xV3 xV4 xV5 xV6 xV7 xV8 xV9
0.5999 0.1859 0.2808 0.1108 0.4003 -0.5932 -0.6133 0.9169