I am presented with a data set, where I am supposed to perform linear regression on this using SGD. My first instinct would be to train each data point there is until I reach the last one. Only then will I get my estimate $\hat{y}$.
I understand there are some drawbacks to this:
- It will take so long to finish, since you have to train all data points.
- Convergence of parameters is still not guaranteed.
Thus, the idea of batching comes into mind. For example, I have a set of 100 data points. I have decided to group them in to 25 batches (or 4 data points each).
My questions are:
- How does this batching work? Do I randomly pick one data point to train from each batch? Meaning, after the end of first run, I will have one estimate.
- Is it possible that I will have 4 different estimates after having 4 different runs? Should I choose whichever gives the smallest error?