I've spent a few days reading some of the new papers about Neural SDEs. For example, here is one from Tzen and Raginsky and here is one that came out simultaneously by Peluchetti and Favaro. There are others which I plan to read next. This work all seems to be inspired by recent popularity of Neural ODEs and also ResNets. The basic idea, which are attained via different routes in each paper, is that if we consider the input data arriving at time $t=0$ and the output data arriving at time $t=1$, and with certain assumptions on the distribution of network weights and activations, the evolution of the data from one layer to the next inside the network is akin to a stochastic process. The more layers you have, the smaller the $\Delta t$ is between the layers. In the limit as the number of layers goes to infinity, the network approaches a true stochastic differential equation.
I am still working on the math, which is my main objective. However, what I find missing from these papers is: Why is this important? The question is not, why is this interesting?. It is certainly interesting from a purely mathematical perspective. But what is the importance here? What is the impact of this technology?
I was at first excited about this because I thought it proposed a way to apply a neural network to learn the the parameters of an SDE by fitting it to real time-series data where we don't know the form of the underlying data generation process. However I noticed in the experiment of Peluchetti and Favaro is simply the MNIST data set, while the data experiment from Tzen and Raginsky is in fact a simulated SDE. The later fit more with my intuition.
So, again, my question is, what is the general importance of Neural SDEs? And a secondary question is: am I correct in thinking this technology proposes a new way to fit a model to data which we suppose is generated by a stochastic process?
** Update **
Well I am still interested to hear what the community has to say but I have kept reading and found a great new paper which proposes to train Neural SDEs via GANs (generative adversarial nets). The literature review in this paper also insightful, stating what I suspected in that each of the authors of the founding papers views the problem slightly differently. For example the Tzen and Raginsky paper describes just fitting the network from one initial point to one terminal value. In this new paper they can fit the model to an entire stochastic process via GANs and actually give 4 empirical examples. This seems similar to how we fit a Gaussian Process to empirical data in that we require that the stochastic process goes through all the points in our data set.