1

How does Spark (or something similar) estimate a logistic regression model, or any statistical model that is estimated by an optimization algorithm, when the data are stored in a distributed environment, such as HDFS?

I read/heard that each iteration is a MapReduce job. How exactly would this work?

Are the solutions approximations? Would I get the same result if I estimated the model on one machine using all the data?

I could not find any useful resources online to these questions.

Glen
  • 6,320
  • 4
  • 37
  • 59
  • Scala implementation is here: https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala – Analyst Jun 18 '18 at 20:19

0 Answers0