How to smooth a curve by learning location and shape of 4 Gaussian kernels?

Question

I have a data set of 365 daily curves $f_i(t)$ and want to smooth them by positioning four radial basis functions $r_i(t)$.

Each daily curve should then be approximated by

$$ \hat{f}_i(t) = w_{i1} r_1(t) + w_{i2} r_2(t) + w_{i3} r_3(t) + w_{i4} r_4(t) $$

I need to learn the x-position and the width (standard deviation?) of each kernel $r_i$. These parameters should be global, i.e. equal over all 365 days.

Then I want to learn the best four weights $w_{ij}$ for each day $(i = 1,\ldots,365, \;\;j=1,\ldots,4)$.

Does this problem have a name I can google for to ideally find software that can find the basis function knots (location and scale) and learn the weights?

You could use a probabilistic programming language like Stan or you could use an optimiser (try Tensorflow?). It shouldn't be too hard to express the problem in either circumstance. — adityar, Oct 30 '18 at 10:07
Do they need to be radial basis functions? Could be say a spline basis or just a fully non-parametric smoothed estimate? It would still be orthogonal, etc. To me, it seems like an extremely obvious case of using [FPCA](https://en.wikipedia.org/wiki/Functional_principal_component_analysis); we get our for functional PCs ($r_{1,\dots, 4}(t)$), we get our FPC scores ($w_i$) and we are good to go. Apologies if I am missing something obvious! :) — usεr11852, Nov 01 '18 at 00:06

score 1 · Accepted Answer · answered Oct 31 '18 at 08:57

Formulating the model as a set of neural nets

The problem can be solved by searching for parameters that minimize the error between the true and approximated curves. One way to think about your model is as a set of radial basis function (RBF) networks, where the parameters governing the radial basis functions are shared across networks. The neural net perspective doesn't change anything, mathematically. But it might be helpful for implementing things in common neural net software packages, or as a point of reference for finding related work. RBF networks are a standard method, although parameter sharing is the unique twist here.

Another way to think about the problem is as a dictionary learning problem, where a set of signals is to be reconstructed as linear combinations of basis signals (called atoms, which together form a dictionary). In your problem, the atoms are parameterized as radial basis functions. Given a set of signals, the goal is to learn both the dictionary and reconstruction weights. This approach is from the signal processing literature. But, most dictionary learning methods don't have the same parametric form for the basis functions as in your problem, and are geared toward sparse reconstruction from overcomplete dictionaries. So, I'll stick to the neural net perspective here.

Network architecture

The architecture is as follows. There are $N=365$ RBF networks (one to approximate each curve). Each network has a single, real-valued input unit, representing time. The input unit connects to $K=4$ hidden units with RBF activation functions. Given input $t$, the output of the $k$th RBF unit is:

$$r_k(t) = \exp \left[ -\frac{(t-c_k)^2}{2 \sigma_k^2} \right]$$

where $c_k$ is the center and $\sigma_k$ is the width. These parameters are shared across all $N$ networks. Note that this implies the the RBF unit outputs for a given $t$ are the same for all networks. So, it's not necessary to recompute them for each network if each curve is sampled at the same time points. The hidden units connect to a single output unit with a linear activation function. The output unit for each network gives the approximated value of the corresponding curve. For the $j$th network, this is (as you have already written):

$$\hat{f}_j(t) = \sum_{k=1}^K w_{jk} r_k(t)$$

where $w_{jk}$ is the weight from the $k$th RBF unit to the output unit.

Initialization

A good initialization is important for learning the parameters. First, you'll need an initial choice of the RBF centers and widths. See Schwenker et al. (2001) for some possible strategies. Then, considering the RBF parameters as fixed, it's possible to solve for the optimal hidden-to-output weights in closed form, since the outputs are just linear combinations of the RBF unit activations. First compute the RBF unit activations as above, for all timepoints. To find the weights for each network, perform linear regression, treating the RBF activations as inputs and the observed values of the corresponding curve as targets.

Training

After initialization, the final step is to jointly optimize the weights and RBF parameters to minimize the error between the true and approximated curves. I'll assume the squared error here, but you could use another loss function if needed.

Suppose the curves are measured at time points $\{t_1, \dots, t_n\}$ (this assumes that all curves are measured at the same time points, but this isn't strictly necessary). Let $f_j(t)$ denote the true/observed value of curve $j$ at time $t$. Let $c = [c_1, \dots, c_K]$ and $\sigma = [\sigma_1, \dots, \sigma_K]$ denote the RBF centers and widths. Let $w_j = [w_{j1}, \dots, w_{jK}]$ denote the hidden-to-output weights for the $j$th network.

The loss function for the $j$th network is:

$$L_j(c, \sigma, w_j) = \frac{1}{n} \sum_{i=1}^n \Big( f_j(t_i) - \hat{f}_j(t_i) \Big)^2$$

The overall loss is the sum of the loss for each network:

$$L(c, \sigma, w_1, \dots, w_N) = \sum_{j=1}^N L_j(c, \sigma, w_j)$$

The parameters are learned to minimize the overall loss:

$$\min_{c, \sigma, w_1, \dots, w_N} \ L(c, \sigma, w_1, \dots, w_N)$$

Solving the optimization problem

The above optimization problem can be tackled using a standard optimization solver. Since there are $K(N + 2) = 1468$ parameters, it would make sense to use an algorithm like L-BFGS or the conjugate gradient method. Alternatively, if there are a huge number of timepoints, you could use stochastic gradient descent, possibly with modifications to speed up convergence (e.g. momentum or Adam). This is a typical approach with large networks and datasets. But, it requires more tweaking. If there are only a moderate number of timepoints, it would probably be overkill for this problem, and possibly slower.

All of these methods require the gradient of $L$ with respect to the parameters. This should be straightforward to derive using the chain rule. Alternatively, the gradient can be computed using automatic differentiation (e.g. as built into software packages like TensorFlow and PyTorch).

Finally note that, because the RBF parameters are adjustable, the problem is not convex. This means that we'll only converge to a local minimum. This is why a good initialization is important. If necessary, the optimization be can run multiple times to find the best solution, starting from different initial parameters.

References

Schwenker, Kestler, Palm (2001). Three learning phases for radial-basis-function networks.

Apologies if I trivialise this, but given the RBFs (which can be fully characterised by $2K$ parameters), why would we need to work with a network? Given $r_{1,\dots, 4}$ we can get the projection as straightforward task using Moore-Penrose or something similar, we would then reconstruct the signal (a usual inner product) and then get our cost. Just $2K$ parameters to optimise for... — usεr11852, Nov 01 '18 at 00:34

How to smooth a curve by learning location and shape of 4 Gaussian kernels?

1 Answers1