0

I am presented with a data set, where I am supposed to perform linear regression on this using SGD. My first instinct would be to train each data point there is until I reach the last one. Only then will I get my estimate $\hat{y}$.

I understand there are some drawbacks to this:

  1. It will take so long to finish, since you have to train all data points.
  2. Convergence of parameters is still not guaranteed.

Thus, the idea of batching comes into mind. For example, I have a set of 100 data points. I have decided to group them in to 25 batches (or 4 data points each).

My questions are:

  1. How does this batching work? Do I randomly pick one data point to train from each batch? Meaning, after the end of first run, I will have one estimate.
  2. Is it possible that I will have 4 different estimates after having 4 different runs? Should I choose whichever gives the smallest error?
cgo
  • 7,445
  • 10
  • 42
  • 61
  • Does this answer your question? [How could stochastic gradient descent save time compared to standard gradient descent?](https://stats.stackexchange.com/questions/232056/how-could-stochastic-gradient-descent-save-time-compared-to-standard-gradient-de) – Haitao Du Nov 11 '21 at 09:17
  • This is not the question. The question is how batching would work. How do you select the items in a batch? Is it a random selection? And what do to do if you have different estimates as a result of training, which one would you choose as the final model for inference? – cgo Nov 11 '21 at 13:30

1 Answers1

0

First, let’s clarify terminology: stochastic gradient descent means doing update one sample at a time, if you use batches it’s batch gradient descent, if you train on all data at once, it’s just gradient descent.

  1. How does this batching work? Do I randomly pick one data point to train from each batch? Meaning, after the end of first run, I will have one estimate.

You make an update using all the data in a batch, same as in gradient descent you would use all your data.

  1. Is it possible that I will have 4 different estimates after having 4 different runs? Should I choose whichever gives the smallest error?

You are splitting the data randomly to batches, so yes, results may differ between trainings. If you train long enough they should converge. Moreover, with gradient descent you usually initialize the data randomly, so for this reason alone you could get different results, even without batches.

Tim
  • 108,699
  • 20
  • 212
  • 390