The universal approximation theorem says that a neural network can be used to approximate any continuous function under some regular conditions with arbitrarily small approximation error, provided we have enough number of nodes in the hidden layer.
Suppose a response y depend on a p-dimensional vector x only through s of the entries (s
However, in practice, this is not usually the case. Why neural network can not learn the sparse structure automatically? Is it because of the non-convexity? Or we don't have enough sample size to learn the specific relationship?