How does Spark (or something similar) estimate a logistic regression model, or any statistical model that is estimated by an optimization algorithm, when the data are stored in a distributed environment, such as HDFS?
I read/heard that each iteration is a MapReduce job. How exactly would this work?
Are the solutions approximations? Would I get the same result if I estimated the model on one machine using all the data?
I could not find any useful resources online to these questions.