Is the difference between SGD and GD really in the summation?

Question

Model: $\min _{\mathbf{w}} \frac{\lambda}{2}\|\mathbf{w}\|^{2}+\frac{1}{m} \sum_{(\mathbf{x}, y) \in S} \ell(\mathbf{w} ;(\phi(\mathbf{x}), y))$

\begin{aligned} &\text { INPUT: } S, \lambda, T \\ &\text { INITIALIZE: Set } \boldsymbol{\alpha}_{1}=0 \\ &\text { FOR } t=1,2, \ldots, T \\ &\text { Choose } i_{t} \in\{0, \ldots,|S|\} \text { uniformly at random. } \\ &\text { For all } j \neq i_{t}, \text { set } \alpha_{t+1}[j]=\alpha_{t}[j] \\ &\text { If } y_{i_{t}} \frac{1}{\lambda t} \sum_{j} \alpha_{t}[j] y_{i_{t}} K\left(\mathbf{x}_{i_{t}}, \mathbf{x}_{j}\right)<1, \text { then: } \\ &\quad \text { Set } \alpha_{t+1}\left[i_{t}\right]=\alpha_{t}\left[i_{t}\right]+1 \\ &\quad \text { Else: } \\ &\quad \text { Set } \alpha_{t+1}\left[i_{t}\right]=\alpha_{t}\left[i_{t}\right] \\ &\text { OUTPUT: } \boldsymbol{\alpha}_{T+1} \end{aligned}

I think the algorithm above is batch gradient descent.

So stochastic gradient descent is?:

\begin{aligned} &\text { INPUT: } S, \lambda, T \\ &\text { INITIALIZE: Set } \boldsymbol{\alpha}_{1}=0 \\ &\text { FOR } t=1,2, \ldots, T \\ &\text { Choose } i_{t} \in\{0, \ldots,|S|\} \text { uniformly at random. } \\ &\text { For all } j \neq i_{t}, \text { set } \alpha_{t+1}[j]=\alpha_{t}[j] \\ &\text { If } y_{i_{t}} \frac{1}{\lambda t} \alpha_{t}[i_{t}] y_{i_{t}} K\left(\mathbf{x}_{i_{t}}, \mathbf{x}_{i_{t}}\right)<1, \text { then: } \\ &\quad \text { Set } \alpha_{t+1}\left[i_{t}\right]=\alpha_{t}\left[i_{t}\right]+1 \\ &\quad \text { Else: } \\ &\quad \text { Set } \alpha_{t+1}\left[i_{t}\right]=\alpha_{t}\left[i_{t}\right] \\ &\text { OUTPUT: } \boldsymbol{\alpha}_{T+1} \end{aligned}

https://www.cs.huji.ac.il/w~shais/papers/ShalevSiSrCo10.pdf

If batch gradient descent is $\frac{d J(\boldsymbol{\alpha})}{d \alpha_{i}}=\lambda^{-1} y_{i} \sum_{j} \alpha_{j} y_{j} \mathbf{x}_{j}^{\mathrm{T}} \mathbf{x}_{i}+\sum_{i=1}^n \log \frac{\alpha_{i}}{1-\alpha_{i}}$

Then Stochastic Gradient Descent is minus the summation?

$\frac{d J(\boldsymbol{\alpha})}{d \alpha_{i}}=\lambda^{-1} \alpha_{i} \mathbf{x}_{i}^{\mathrm{T}} \mathbf{x}_{i}+\log \frac{\alpha_{i}}{1-\alpha_{i}}$?

Is the difference between SGD and GD really in the summation?

0 Answers0