Background
I have a collection of functions with trainable parameters that I am implementing as Keras model classes, which enables immediate use of a variety of objective functions, optimizers, and training-related methods (e.g. early stopping callback).
These functions take a single variable, output a single variable, and have no more than a dozen parameters. The number of explicitly-written operators ('+', '-', '*', '/', 'exp', 'log', 'arctan') is also around a dozen, although I caution that this measure of model complexity is unreliable (i.e. equivalent expressions could have greater or fewer the number of explicitly-written operators). The point is that these are not the enormously-complex models like those used in deep learning. Let the following exemplify this description.
Example
Verhulst growth model $$P(t) = \frac{K}{1+ \left( \frac{K-P_0}{P_0} \right) \exp \{-rt\}}$$ where $P(t)$ is a population size at time $t$, $K$ is the carrying capacity, $P_0$ is the initial population size, and $r$ is the "unimpeded" exponential rate constant.
Problem Statement
I started off with random initialization of paramters by sampling from either a standard normal distribution or a uniform distribution over $[0,1]$. But I have encountered the following issues for which I have made only partial progress on addressing:
- Non-convexity of loss function (often mean-squared error) over the parameters of many of these models, combined with sampling the parameter space near the boundaries of convex regions, has resulted in parameter estimations that simply started in the wrong "valley".
- If an initial parameter is quite far away from its optimal value, even within the same convex region, it can take an extremely long time to converge.
I have found studying 3D surface plots and contour plots of the loss function over pairs of parameters useful, along with Hessian-based tests of convexity. For sufficiently small datasets and simple models, I have found it possible to copy-paste the data into tools like Desmos calculator and manually tune parameters, but this does not scale. I have room to grow on this subject, and a source that accelerates my learning could make a tangible difference in my productivity in building the training methods of my models.
Question
Does there exist a guide for designing self-starter estimators of such parametric functions?