I have been working on this problem for a while and thanks to @StopClosingQuestionsFast suggestions and some literature research, I have managed to come up with a solution that seems to work.
So, let's restate the problem:
$\tilde{y} = y + n_y$
$\tilde{s}_1 = s_1 + n_1$
$\tilde{s}_2 = s_2 + n_2$
$y = \beta_1s_1 + \beta_2s_2 + \beta_3f(\beta_4, \beta_5)$
I am trying to fit observation $\tilde{y} \in \mathbb{R}^{n\times1}$ with noise $n_y\sim\mathcal{N}(0, C_y)$ with model $y$ which depends on parameter vector $\beta\in \mathbb{R}^{5\times1}$. Model $y$ also depends on two vectors $s_1$ and $s_2$, which are not known exactly, but observations $\tilde{s}_1$ and $\tilde{s}_2$ with perturbations $n_1\sim\mathcal{N}(0, C_1)$ and $n_2\sim\mathcal{N}(0, C_2)$ are available. The goal is to define $\beta$; however, since $s_1$ and $s_2$ are not known, it is necessary to estimate them too.
To solve the problem we will need to maximize likelihood. The likelihood function of the problem can be written as follows:
$\begin{align*}
p(\tilde{y}, \tilde{s}_1, \tilde{s}_2 | \beta, s_1, s_2) = &\frac{1}{\sqrt{(2\pi)^n\det{C_y}}}\exp{(
-\frac{1}{2} (\tilde{y}-y)^\intercal C_y^{-1} (\tilde{y}-y)
)} \\ \times &
\frac{1}{\sqrt{(2\pi)^n\det{C_1}}}\exp{(
-\frac{1}{2} (\tilde{s}_1-s_1)^\intercal C_1^{-1} (\tilde{s}_1-s_1)
)} \\ \times &
\frac{1}{\sqrt{(2\pi)^n\det{C_2}}}\exp{(
-\frac{1}{2} (\tilde{s}_2-s_2)^\intercal C_2^{-1} (\tilde{s}_2-s_2)
)}
\end{align*}$
Substituting the expression for $y$, negative log-likelihood can be written as
$\begin{align*}
-\log{(p(\tilde{y}, \tilde{s}_1, \tilde{s}_2 | \beta, s_1, s_2))} = &
\frac{1}{2}(\tilde{y}-\beta_1s_1 - \beta_2s_2 - \beta_3f(\beta_4, \beta_5))^\intercal C_y^{-1} (\tilde{y}-\beta_1s_1 - \beta_2s_2 - \beta_3f(\beta_4, \beta_5))
\\ + & \frac{1}{2}(\tilde{s}_1-s_1)^\intercal C_1^{-1} (\tilde{s}_1-s_1)
\\ + & \frac{1}{2}(\tilde{s}_2-s_2)^\intercal C_2^{-1} (\tilde{s}_2-s_2)
\\ + & const
\end{align*}$
To find optimal values of $\beta$, $s_1$, and $s_2$ one needs to minimize negative log-likelihood (and maximize likelihood). Generally, since $f$ depends non-linearly on $\beta_4$ and $\beta_5$, the optimization problem is non-linear and should be treated numerically using appropriate algorithms. However, the problem can be substantially simplified if we first express $s_1$ and $s_2$ as a function of $\beta$ and rewrite the negative log-likelihood as a function of only $\beta$. By doing so, this will decrease the numerical overhead associated with the minimization of $-\log{(p)}$.
To find optimal $s_1$ and $s_2$ for given $\beta$, we take the following partial derivatives and set them to zero:
$\frac{\partial}{\partial s_1}(-\log{(p)}) = -\beta_1C_y^{-1}(\tilde{y}-\beta_1s_1 - \beta_2s_2-\beta_3f) - C_1^{-1}(\tilde{s}_1 - s_1)=0$
$\frac{\partial}{\partial s_2}(-\log{(p)}) = -\beta_2C_y^{-1}(\tilde{y}-\beta_1s_1 - \beta_2s_2-\beta_3f) - C_2^{-1}(\tilde{s}_2 - s_2)=0$
I have ommited dependence of $f$ on $\beta_4$ and $\beta_5$ for brevity. This can be rewritten as follows:
$(\beta_1^2C_y^{-1} + C_1^{-1})s_1 + \beta_1\beta_2C_y^{-1}s_2=\beta_1C_y^{-1}(\tilde{y}-\beta_3f) +C_1^{-1} \tilde{s}_1$
$\beta_1\beta_2C_y^{-1}s_1 + (\beta_2^2C_y^{-1} + C_2^{-1})s_2 = \beta_2C_y^{-1}(\tilde{y}-\beta_3f) +C_2^{-1} \tilde{s}_2 $
Now, by stacking vectors, we can solve the equation with respect to $s_1$ and $s_2$:
$\begin{pmatrix}
s_1 \\
s_2 \\
\end{pmatrix}=\begin{pmatrix}
\beta_1^2C_y^{-1} + C_1^{-1} & \beta_1\beta_2C_y^{-1} \\
\beta_1\beta_2C_y^{-1} & \beta_2^2C_y^{-1} + C_2^{-1} \\
\end{pmatrix}^{-1}\begin{pmatrix}
\beta_1C_y^{-1} & C_1^{-1} & 0_{n\times n} \\
\beta_2C_y^{-1} & 0_{n\times n} & C_2^{-1} \\
\end{pmatrix}\begin{pmatrix}
\tilde{y} - \beta_3f \\
\tilde{s}_1 \\
\tilde{s}_2
\end{pmatrix}$
This solution can be used to rewrite $-\log{(p)}$ as a function of only $\beta$ and then it can be minimized using standard numerical methods. To get the initial values to start optimization, the linear portion can be solved using generalized least squares, i.e. assuming that $\tilde{s}_1$ and $\tilde{s}_2$ have no noise. This, of course, works if you have some reasonable estimation of the non-linear portion of the parameters. In my field, I can usually say what they are within reason, so this is not a big issue.
To compute the covariance matrix of the estimated $\beta$ it is necessary to compute the inverse of hessian of negative log-likelihood with respect to $\beta$, $s_1$, and $s_2$.
Now below is a sample class written in python that implements the fitting. I have implemented the actual minimization of neg-log-likelihood using the Levenberg-Marquardt method as it seemed to be the most efficient from my testing.
In the example below I assume that I know the non-linear parameters $\beta_4$ and $\beta_5$ (they are usually approximately known in the problems I am dealing with) and the goal is to estimate their uncertainty.
import numpy as np
from matplotlib import pyplot as plt
from scipy import optimize
#%%
# non-linear partial TLS class
class NLTLS:
def __init__(self, x, y_t, s1_t, s2_t, f_func,
Cy, C1, C2,
beta_nl_0, m=5):
self.x = x
self.y_t = y_t
self.s1_t = s1_t
self.s2_t = s2_t
self.f_func = f_func
self.Cy = Cy
self.C1 = C1
self.C2 = C2
self.Cy_inv = np.linalg.pinv(Cy)
self.C1_inv = np.linalg.pinv(C1)
self.C2_inv = np.linalg.pinv(C2)
self.Ly = np.linalg.cholesky(self.Cy_inv)
self.L1 = np.linalg.cholesky(self.C1_inv)
self.L2 = np.linalg.cholesky(self.C2_inv)
self.beta_nl_0 = beta_nl_0
self.n = x.size
self.m = m
def get_y(self, beta, s1, s2):
b1, b2, b3, b4, b5 = beta
return b1*s1 + b2*s2 + b3*self.f_func([b4, b5], self.x)
def get_s(self, beta):
b1, b2, b3, b4, b5 = beta
f = self.f_func([b4, b5], self.x)
a11 = b1**2 * self.Cy_inv + self.C1_inv
a12 = b1 * b2 * self.Cy_inv
a22 = b2**2 * self.Cy_inv + self.C2_inv
A = np.linalg.pinv(np.block([[a11, a12],
[a12, a22]]))
b11 = b1 * self.Cy_inv
b12 = self.C1_inv
b13 = np.zeros((self.n,self.n))
b21 = b2 * self.Cy_inv
b23 = self.C2_inv
B = np.block([[b11, b12, b13],
[b21, b13, b23]])
z = np.block([self.y_t - b3*f, self.s1_t, self.s2_t])
s = A @ B @ z
return s[:n], s[n:]
def chisq(self, beta, s1, s2):
b1, b2, b3, b4, b5 = beta
ry = self.y_t - b1*s1 - b2*s2 - b3*self.f_func([b4, b5], self.x)
r1 = self.s1_t - s1
r2 = self.s2_t - s2
return (ry @ self.Cy_inv @ ry +
r1 @ self.C1_inv @ r1 +
r2 @ self.C2_inv @ r2)
def resid_ols(self, beta):
b1, b2, b3, b4, b5 = beta
ry = self.y_t - b1*self.s1_t - b2*self.s2_t - b3*self.f_func([b4, b5], self.x)
return self.Ly.T @ ry
def resid_tls_beta(self, beta): # residuals only as a fucntion of beta
b1, b2, b3, b4, b5 = beta
s1, s2 = self.get_s(beta)
ry = self.y_t - b1*s1 - b2*s2 - b3*self.f_func([b4, b5], self.x)
r1 = self.s1_t - s1
r2 = self.s2_t - s2
return np.hstack((self.Ly.T @ ry,
self.L1.T @ r1,
self.L2.T @ r2))
def resid_tls(self, p):
b1, b2, b3, b4, b5 = p[:self.m]
s1 = p[self.m : self.n+self.m]
s2 = p[self.n+self.m:]
ry = self.y_t - b1*s1 - b2*s2 - b3*self.f_func([b4, b5], self.x)
r1 = self.s1_t - s1
r2 = self.s2_t - s2
return np.hstack((self.Ly.T @ ry,
self.L1.T @ r1,
self.L2.T @ r2))
def gls_lin(self, q):
b4, b5 = q
X = np.hstack((self.s1_t[:, None],
self.s2_t[:, None],
self.f_func([b4, b5], self.x)[:, None]))
b1,b2,b3 = np.linalg.pinv(X.T @ self.Cy_inv @ X) @ X.T @ self.Cy_inv @ self.y_t
return [b1,b2,b3,b4,b5]
def gls(self):
self.beta_ols_lin = self.gls_lin(self.beta_nl_0)
result = optimize.least_squares(self.resid_ols,
self.beta_ols_lin,
method='lm')
self.beta_ols = result['x']
self.y_ols = self.get_y(self.beta_ols, self.s1_t, self.s2_t)
J_w = result['jac']
H = J_w.T @ J_w
self.C_beta_ols = np.linalg.pinv(H)
# calculate y_ols uncertainty
J = np.linalg.pinv(self.Ly.T) @ J_w
self.Cy_ols = (J @ np.linalg.pinv(H) @ J.T)
def tls(self):
# get the starting guess for linear portion of beta
self.beta_ols_lin = self.gls_lin(self.beta_nl_0)
# optimize the values
result_beta = optimize.least_squares(self.resid_tls_beta,
self.beta_ols_lin,
method='lm')
# assign the results
self.beta_tls = result_beta['x']
self.s1_tls, self.s2_tls = self.get_s(self.beta_tls)
self.y_tls = self.get_y(self.beta_tls, self.s1_tls, self.s2_tls)
# estimate uncertainties for beta_tls - done separately for efficiency
q_all = np.hstack((self.beta_tls, self.s1_tls, self.s2_tls))
result_all = optimize.least_squares(self.resid_tls, q_all, method='lm')
J_w = result_all['jac']
C_all = np.linalg.pinv(J_w.T @ J_w)
self.C_beta_tls = C_all[: self.m, : self.m]
self.C1_tls = C_all[self.m : self.m+self.n,
self.m : self.m+self.n]
self.C2_tls = C_all[self.m+self.n :,
self.m+self.n :]
# calculate y_tls and its uncertainty
Ly12_inv = np.hstack((np.linalg.pinv(self.Ly.T),
np.linalg.pinv(self.L1.T),
np.linalg.pinv(self.L2.T)))
J = Ly12_inv @ J_w
self.Cy_tls = J @ C_all @ J.T
def pretty_print(self, beta_true):
dof = self.n-self.m
print('Printing TLS results')
print('1. MSE')
print('TLS\tGLS')
print('%0.3f' %(self.chisq(self.beta_tls, self.s1_tls, self.s2_tls)/dof),
'\t%0.3f' %(self.chisq(self.beta_ols, self.s1_t, self.s2_t)/dof))
print('')
print('2. Best fit parameters w/ 1-sigma errors')
print('\ttrue\t TLS\t\t\t OLS')
for i in range(self.m):
print('b'+str(i+1), '\t%0.3f' %beta_true[i], '\t',
'%0.3f' %self.beta_tls[i], '+/-',
'%0.3f' %np.sqrt(self.C_beta_tls[i, i]), '\t',
'%0.3f' %self.beta_ols[i], '+/-',
'%0.3f' %np.sqrt(self.C_beta_ols[i, i]))
def pretty_plot(self, y_true, s1_true, s2_true, plot_ols=True):
plt.figure()
plt.clf()
plt.subplot(311)
plt.plot(self.x, y_true, 'b-', label='true')
plt.plot(self.x, self.y_t, 'k.-', label='data')
plt.plot(self.x, self.y_tls, 'r-', label='TLS')
if plot_ols:
plt.plot(self.x, self.y_ols, 'r:', label='OLS')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.subplot(312)
plt.plot(self.x, s1_true, 'b-', label='true')
plt.plot(self.x, self.s1_t, 'k.-', label='data')
plt.plot(self.x, self.s1_tls, 'r-', label='TLS')
plt.legend()
plt.xlabel('x')
plt.ylabel('s1')
plt.subplot(313)
plt.plot(self.x, s2_true, 'b-', label='true')
plt.plot(self.x, self.s2_t, 'k.-', label='data')
plt.plot(self.x, self.s2_tls, 'r-', label='TLS')
plt.legend()
plt.xlabel('x')
plt.ylabel('s2')
if __name__ == '__main__':
# Generate true x, s1, s2, f, y, beta
b1, b2, b3 = -3, 4.5, 1
b4, b5 = 2, 0.5
beta_true = [b1, b2, b3, b4, b5]
def f_func(beta, x):
return np.sin(beta[0]*x + beta[1])
n = 201 # number of points in x
x = np.linspace(0, 10, n)
s1_true = np.exp( - ((x-2)**2) / (np.sqrt(2)*1)**2)
s2_true = np.exp( - ((x-8)**2) / (np.sqrt(2)*0.5)**2)
f_true = f_func([b4, b5], x)
y_true = b1*s1_true + b2*s2_true + b3*f_true
# Generate covariance matrices for the data
np.random.seed(1)
m = 10000
N_y = np.random.randn(n, m)*1 + (x[:, None]-2)**2 * np.random.randn(1, m)
Cy_data = np.cov(N_y)/m
N_1 = np.random.randn(n, m)*1 + (x[:, None]-2)**3 * np.random.randn(1, m)/10
C1_data = np.cov(N_1)/m
N_2 = np.random.randn(n, m)*1 + (x[:, None]-8)**3 * np.random.randn(1, m)/10
C2_data = np.cov(N_2)/m
# generate data
def generate_data(a_list, C_list):
d_list = []
for a, C in zip(a_list, C_list):
L = np.linalg.cholesky(np.linalg.pinv(C))
noise = np.linalg.pinv(L.T) @ np.random.randn(a.size)
d = a + noise
d_list.append(d)
return d_list
y_data, s1_data, s2_data = generate_data([y_true, s1_true, s2_true],
[Cy_data, C1_data, C2_data])
# Solving the problem
nltls = NLTLS(x, y_data, s1_data, s2_data, f_func,
Cy_data, C1_data, C2_data,
[b4, b5])
nltls.gls()
nltls.tls()
nltls.pretty_print(beta_true)
nltls.pretty_plot(y_true, s1_true, s2_true)
To quickly check whether the likelihood of the parameters obtained from the fitting follows normal distribution I ran the fitting on 100 different datasets and checked whether results fall within standard confidence intervals that correspond to 1,2, and 3 sigmas (a bit clumsy, but works):
# Check if predicted CI are correct
def is_within_123_sigma(z_true, z_est, Cz):
return np.abs(z_est - z_true) < np.array([1,2,3])[:, None]*np.sqrt(np.diag(Cz))
def test_ci(k=1):
is_within_ols = np.zeros((k, 3, 5), bool)
is_within_tls = np.zeros((k, 3, 5), bool)
for i in range(k):
print('Progress: ', i+1, '/', k)
y_data, s1_data, s2_data = generate_data([y_true, s1_true, s2_true], [Cy_data, C1_data, C2_data])
nltls = NLTLS(x, y_data, s1_data, s2_data, f_func,
Cy_data, C1_data, C2_data,
[b4, b5])
nltls.gls()
nltls.tls()
is_within_ols[i, :, :] = is_within_123_sigma(beta_true, nltls.beta_ols, nltls.C_beta_ols)
is_within_tls[i, :, :] = is_within_123_sigma(beta_true, nltls.beta_tls, nltls.C_beta_tls)
return is_within_ols, is_within_tls
h=100
is_within_ols, is_within_tls = test_ci(k=h)
def print_ci_results(method_str, is_within):
print('% of beta containing true value within x-sigma for '+method_str)
print('par\t1-sigma\t2-sigma\t3-sigma')
for i in range(5):
bla = ['%.1f' %j for j in np.sum(is_within_tls[:,:,i], axis=0)/h*100]
print('b'+str(i+1), end='\t')
for j in bla: print(j, end='\t')
print('')
print_ci_results('GLS', is_within_ols)
print_ci_results('TLS', is_within_tls)
Below is the resulting table showing the percentage of the trials where true value was within the corresponding confidence level. The results seem to be pretty okay, especially at 95% confidence level:
par 1-sigma 2-sigma 3-sigma
b1 66.0 96.0 100.0
b2 78.0 96.0 99.0
b3 68.0 95.0 100.0
b4 70.0 98.0 99.0
b5 67.0 96.0 99.0
I have implemented a more general class dealing with an arbitrary number of inputs $s$, multiple $y$'s and complex covariance structure for $y$. I will share a git repository for this soon, once I clear up the code a bit.
I hope this will help whoever will encounter similar problems in the future.
Additional reading:
- A good paper with detailed derivations for partial linear TLS:
https://link.springer.com/article/10.1007/s00190-012-0552-9
- Good paper dealing with linear TLS with correlated errors:
https://arc.aiaa.org/doi/abs/10.2514/6.2019-1931
- Papers by Alireza Amiri-Simkooei and his collaborators were really helpful