Suppose I am trying to tune weights W of a neural network for the problem which is non-smooth, by using an expensive numerical calculation of gradients. I have been stuck at not being able to get a good solution in reasonable time.
Then I decide to trick the problem by following:
generate 10K vectors of W' that cover most of the domain of values that W can occupy, and calculate the loss function L for them using a very non-trivial set of rules involving the output of the neural network.
I create X from W' (in a sense of input space) and fit another neural network which should approximate calculated L. I wanted to, first of all, understand whether it is possible to map weights to loss ignoring the true input and output spaces. It showed that for simple problems I can actually do that.
For a difficult problem I mentioned in the beginning the mapping appears nothing more than a mere averaging of the output, without any significant correlation between Y and Y'.
Can I make an intuitive statement that when I cannot succeed in mapping weights to output through a set of some simple functions (a case of a neural network), the original problem of weight tuning appears to be too noisy or unsolvable in principle? Or is it too vague or wrong?
UPDATE:
Referring to an older question here,
In case I am sure that there is a continuous kind of relation between X and neural network output (where X is a big randomized set of my weights), how can I proceed to apply a differentiable-assumed method to solve argmax(output) | X
?