How do I interpret the covariance matrix from a curve fit?

Question

I'm not too great at statistics, so apologies if this is a simplistic question. I am fitting a curve to some data, and sometimes my data best fits a negative exponential in the form $a * e^{(-b * x)} + c$, and sometimes the fit is closer to $a * e^{(-b * x^2)} + c$. However, sometimes both of those fail, and I would like to fall back to a linear fit. My question is, how can I determine which model fits a particular data set the best from the resulting variance-covariance matrix that is returned from the scipy.optimize.curve_fit() function? I believe the variance is on one of the diagonals of this matrix, but I'm not sure how to interpret that.

UPDATE: Based on a similar question, I'm hoping that the variance-covariance matrix can tell me which of the three models I am attempting best fits the data (I am trying to fit many datasets to one of these three models).

The resulting matrices look like this for the given example:

pcov_lin 
[[  2.02186921e-05  -2.02186920e-04]
 [ -2.02186920e-04   2.76322124e-03]]
pcov_exp
[[  9.05390292e+00  -7.76201283e-02  -9.20475334e+00]
 [ -7.76201283e-02   6.69727245e-04   7.90218415e-02]
 [ -9.20475334e+00   7.90218415e-02   9.36160310e+00]]
pcov_exp_2 
[[  1.38338049e-03  -7.39204594e-07  -7.81208814e-04]
 [ -7.39204594e-07   8.99295434e-09   1.92970700e-06]
 [ -7.81208814e-04   1.92970700e-06   9.14746758e-04]]

Here is an example of what I am doing:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import scipy.optimize

def exp_func(x, a, b, c):
    return a * np.exp(-b * x) + c

def exp_squared_func(x, a, b, c):
    return a * np.exp(-b * x*x*x) + c

def linear_func(x, a, b):
    return a*x + b

def main():
    x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], np.float)
    y = np.array([1, 1, 1, 1, 0.805621, 0.798992, 0.84231, 0.728796, 0.819471, 0.570414, 0.355124, 0.276447, 0.159058, 0.0762189, 0.0167807, 0.0118647, 0.000319948, 0.00118267, 0, 0, 0], np.float)

    p0 = [0.7746042467213462, 0.10347274384077858, -0.016253458007293588]
    popt_lin, pcov_lin      = scipy.optimize.curve_fit(linear_func, x, y)
    popt_exp, pcov_exp      = scipy.optimize.curve_fit(exp_func, x, y)
    popt_exp_2, pcov_exp_2  = scipy.optimize.curve_fit(exp_squared_func, x, y)

    plt.figure()
    plt.plot(x, y, 'ko', label="Original data")
    plt.plot(x, linear_func(x, *popt_lin), 'r-', label='linear')
    plt.plot(x, exp_func(x, *popt_exp), 'b-', label='exponential')
    plt.plot(x, exp_squared_func(x, *popt_exp_2), 'g-', label='exponential squared')
    plt.legend()
    plt.show()

if __name__ == '__main__':
    main()

It is great that you link to that CV question and, consequently, to the important comment thread (b/w rolando2, Frank Harrell, ...) questioning whether it is appropriate to pick the model post facto based on fit. Perhaps it is better to use prior knowledge of the system to pick the model. — Aman, Dec 01 '13 at 04:14
This other question on CV might be helpful: http://stats.stackexchange.com/questions/50830/can-i-convert-a-covariance-matrix-into-uncertainties-for-variables — Aman, Dec 01 '13 at 06:36
Could this be helpful to understand how to interpret co-variance matrix https://stats.stackexchange.com/questions/10795/how-to-interpret-an-inverse-covariance-or-precision-matrix/10801#10801 - I would say the value of the third models are smaller, indicating less deviation. — Areza, Nov 15 '18 at 15:43

score 7 · Answer 1 · answered Feb 07 '19 at 15:33

As a clarification, the variable pcov from scipy.optimize.curve_fit is the estimated covariance of the parameter estimate, that is loosely speaking, given the data and a model, how much information is there in the data to determine the value of a parameter in the given model. So it does not really tell you if the chosen model is good or not. See also this.

The problem what is a good model is indeed a hard problem. As argued by statisticians

All models are wrong, but some are useful

So the criteria to use in comparison of different models depends on what you want to achieve.

For instance, if you want a curve that is the "close as possible" to the data, you could select a model which gives the smallest residual. In your case it would be the model func and the estimated parameters popt that has the lowest value when computing

numpy.linalg.norm(y-func(x, *popt))

However, if you select a model with more parameters, the residual will automatically decrease, at the cost of higher model complexity. So then it comes back to what the goal is of the model.

How do I interpret the covariance matrix from a curve fit?

1 Answers1

Linked