TL;DR: For time series and density modeling, neural ODEs offer some benefits that we don't know how to get otherwise. For plain supervised learning, there are potential computational benefits, but for practical purposes they probably aren't worth using yet in that setting.
To answer your first question:
Is there something NeuralODEs do that "conventional" Neural Networks
cannot?
Neural ODEs differ in two ways from standard nets:
- They represent a different set of functions, which can be good or bad depending on what you're modeling.
- We have to approximate their exact solution, which gives more freedom in how to compute the answer, but adds complexity.
I'd say the clearest setting where neural ODEs help is building continuous-time time series models, which can easily handle data coming at irregular intervals. However, ODEs can only model deterministic dynamics, so I'm more excited by generalization of these time-series models to stochastic differential equations.
If you're modeling data sampled at regular time intervals (like video or audio), I think there's not much advantage, and standard approaches will probably be simpler and faster.
Another setting where they have an advantage is in building normalizing flows for density modeling. The bottleneck in normalizing flows is keeping track of the change in density, which is slow (O(D^3)) for standard nets. That's why discrete-time normalizing flow models like Glow or Real-NVP have to restrict the architectures of their layers, for example only updating half the units depending on the other half. In continuous time, it's easier to track the change in density, even for unrestricted architectures. That's what the FFJORD paper is about. Since then, Residual Flows were developed, which are discrete time flows that can also handle unrestricted architectures, with some caveats.
For standard deep learning, there are two potential big advantages:
- Constant memory cost at training time. Before neural ODEs there was already some work showing we can reduce the memory cost of computing reverse-mode gradients of neural networks if we could 'run them backwards' from the output, but this required restricting the architecture of the network. The nice thing about neural ODEs that you you can simply run their dynamics backwards to reconstruct the original trajectory. In both cases, compounding numerical error could be a problem in some cases, but we didn't find this to be a practical concern.
- Adaptive time cost. The idea is that since we're only approximating an exact answer, sometimes we might only need a few iterations of our approximate solver to get an acceptably good answer, and so could save time.
Both of these potential advantages are shared by Deep Equilibrium Models, and they've already been scaled up to transformers. But in both cases, these models so far in practice have tended to be slower overall than standard nets, because we don't yet know how to regularize these models to be easy to approximate.
To answer your second question:
Is there something "conventional" Neural Networks do that NeuralODEs
cannot do?
- Conventional nets can fit non-homeomorphic functions, for example functions whose output has a smaller dimension that their input, or that change the topology of the input space. There was a nice paper from Oxford pointing out these issues, and showing that you can also fix it by adding extra dimensions.
Of course, you could handle this by composing ODE nets with standard network layers.
- Conventional nets can be evaluated exactly with a fixed amount of computation, and are typically faster to train. Plus, with standard nets you don't have to choose an error tolerance for a solver.