Formulating the model as a set of neural nets
The problem can be solved by searching for parameters that minimize the error between the true and approximated curves. One way to think about your model is as a set of radial basis function (RBF) networks, where the parameters governing the radial basis functions are shared across networks. The neural net perspective doesn't change anything, mathematically. But it might be helpful for implementing things in common neural net software packages, or as a point of reference for finding related work. RBF networks are a standard method, although parameter sharing is the unique twist here.
Another way to think about the problem is as a dictionary learning problem, where a set of signals is to be reconstructed as linear combinations of basis signals (called atoms, which together form a dictionary). In your problem, the atoms are parameterized as radial basis functions. Given a set of signals, the goal is to learn both the dictionary and reconstruction weights. This approach is from the signal processing literature. But, most dictionary learning methods don't have the same parametric form for the basis functions as in your problem, and are geared toward sparse reconstruction from overcomplete dictionaries. So, I'll stick to the neural net perspective here.
Network architecture
The architecture is as follows. There are $N=365$ RBF networks (one to approximate each curve). Each network has a single, real-valued input unit, representing time. The input unit connects to $K=4$ hidden units with RBF activation functions. Given input $t$, the output of the $k$th RBF unit is:
$$r_k(t) = \exp \left[ -\frac{(t-c_k)^2}{2 \sigma_k^2} \right]$$
where $c_k$ is the center and $\sigma_k$ is the width. These parameters are shared across all $N$ networks. Note that this implies the the RBF unit outputs for a given $t$ are the same for all networks. So, it's not necessary to recompute them for each network if each curve is sampled at the same time points. The hidden units connect to a single output unit with a linear activation function. The output unit for each network gives the approximated value of the corresponding curve. For the $j$th network, this is (as you have already written):
$$\hat{f}_j(t) = \sum_{k=1}^K w_{jk} r_k(t)$$
where $w_{jk}$ is the weight from the $k$th RBF unit to the output unit.
Initialization
A good initialization is important for learning the parameters. First, you'll need an initial choice of the RBF centers and widths. See Schwenker et al. (2001) for some possible strategies. Then, considering the RBF parameters as fixed, it's possible to solve for the optimal hidden-to-output weights in closed form, since the outputs are just linear combinations of the RBF unit activations. First compute the RBF unit activations as above, for all timepoints. To find the weights for each network, perform linear regression, treating the RBF activations as inputs and the observed values of the corresponding curve as targets.
Training
After initialization, the final step is to jointly optimize the weights and RBF parameters to minimize the error between the true and approximated curves. I'll assume the squared error here, but you could use another loss function if needed.
Suppose the curves are measured at time points $\{t_1, \dots, t_n\}$ (this assumes that all curves are measured at the same time points, but this isn't strictly necessary). Let $f_j(t)$ denote the true/observed value of curve $j$ at time $t$. Let $c = [c_1, \dots, c_K]$ and $\sigma = [\sigma_1, \dots, \sigma_K]$ denote the RBF centers and widths. Let $w_j = [w_{j1}, \dots, w_{jK}]$ denote the hidden-to-output weights for the $j$th network.
The loss function for the $j$th network is:
$$L_j(c, \sigma, w_j) =
\frac{1}{n} \sum_{i=1}^n \Big( f_j(t_i) - \hat{f}_j(t_i) \Big)^2$$
The overall loss is the sum of the loss for each network:
$$L(c, \sigma, w_1, \dots, w_N) = \sum_{j=1}^N L_j(c, \sigma, w_j)$$
The parameters are learned to minimize the overall loss:
$$\min_{c, \sigma, w_1, \dots, w_N} \ L(c, \sigma, w_1, \dots, w_N)$$
Solving the optimization problem
The above optimization problem can be tackled using a standard optimization solver. Since there are $K(N + 2) = 1468$ parameters, it would make sense to use an algorithm like L-BFGS or the conjugate gradient method. Alternatively, if there are a huge number of timepoints, you could use stochastic gradient descent, possibly with modifications to speed up convergence (e.g. momentum or Adam). This is a typical approach with large networks and datasets. But, it requires more tweaking. If there are only a moderate number of timepoints, it would probably be overkill for this problem, and possibly slower.
All of these methods require the gradient of $L$ with respect to the parameters. This should be straightforward to derive using the chain rule. Alternatively, the gradient can be computed using automatic differentiation (e.g. as built into software packages like TensorFlow and PyTorch).
Finally note that, because the RBF parameters are adjustable, the problem is not convex. This means that we'll only converge to a local minimum. This is why a good initialization is important. If necessary, the optimization be can run multiple times to find the best solution, starting from different initial parameters.
References
Schwenker, Kestler, Palm (2001). Three learning phases for radial-basis-function networks.