0

Cross-posted from Bioinformatics SE here. The program is used for regression of a bioassay standard curve, but the question I'm asking is statistical in nature.

I am attempting to write a tool to be used by my lab to automatically generate a 5PL Standard Curve for ELISA data, using an XLSX file template. I am using pandas dataframes to hold the 96 wells as different rows, then in a separate dataframe, holding Standard Curve data (not all wells need a column for known concentration). I am also using scipy and numpy for analysis, then using matplotlib.pyplot for plotting the curve (to eventually be output to the XLSX), though I am open to suggestions.

So far, the program correctly takes the input, generates the aforementioned dataframes, and makes initial guesses for the 5PL's parameters. I was attempting to modify the 4PL code given here to work with my standard curve, using all actual standard replicates to optimize the curve, and obviously using 5PL instead of 4PL. However, when I use scipy's least_squares function for parameter optimization, I receive the following errors/traceback:

C:\Users\Jake\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: RuntimeWarning: overflow encountered in power


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-70b754f01d56> in <module>
      4 
      5 # Fit equation using least squares optimization
----> 6 plsq = least_squares(fun = residuals, jac = 'cs', x0 = p0, bounds = (0.0001, 100000), args=(y_meas, x))

~\Anaconda3\lib\site-packages\scipy\optimize\_lsq\least_squares.py in least_squares(fun, x0, jac, bounds, method, ftol, xtol, gtol, x_scale, loss, f_scale, diff_step, tr_solver, tr_options, jac_sparsity, max_nfev, verbose, args, kwargs)
    793 
    794     if not in_bounds(x0, lb, ub):
--> 795         raise ValueError("`x0` is infeasible.")
    796 
    797     x_scale = check_x_scale(x_scale, x0)

ValueError: `x0` is infeasible.

I was initially receiving these errors with the aforementioned article's suggested scipy.leastsq function, but I assumed it was minimizing exponents to 0 and encountering an overflow, so I swapped to scipy.least_squares instead due to its bounds argument. Code in question is below. Any help would be greatly appreciated! Thank you so much!

(I did find lmfit-py, though I was hoping to complete this project on my own if possible to get a better understanding of the regression. Still open to using it instead if that seems to be a better option.)

import matplotlib.pyplot as plt
from scipy.optimize import least_squares

def logistic5(x, A, B, C, D, E):
    """5PL logistic equation"""
    return D + ((A-D)/(np.power((1 + np.power((x / C), B)), E)))

def residuals(p, y, x):
    """Deviations of data from fitted 5PL Curve"""
    A, B, C, D, E = p
    err = y - logistic5(x, A, B, C, D, E)
    return err

def peval(x, p):
    """Evaluated value at x with current parameters"""
    A, B, C, D, E = p
    return logistic5(x, A, B, C, D, E)

x = []
y_meas = []
for ind, row in std_df.iterrows(): # Iterate through rows in std dataframe.
    for rep in std_df_rep_cols: # Iterate through replicate columns.
        if not pd.isnull(row[rep]): # If there is a value in the cell...
            x.append(row['conc']) # Add an iteration of the respective conc. to the x list.
            y_meas.append(row[rep]) # Add the cell's value to the y_meas list.

x = np.asarray(x)
y_meas = np.asarray(y_meas)

print(len(x))
print(len(y_meas))

print(x)
print(y_meas)

# Initial approximations of parameters
A = np.amax([np.amin(minBlk_ser)] + [0]) # Min asymptote
D = np.amax(y_meas) # Max asymptote
B = (D - A) / np.amax(x) # Steepness
C = np.amax(x) / 2 # Inflection Point, conc at which y = (D - A) / 2
E = 1 # Asymmetry factor
print(A, B, C, D, E)

p0 = [A, B, C, D, E] # List containing initial guesses as arg for least_squares
y_true = logistic5(x, A, B, C, D, E) # Assumed "true" curve based on initial params

# Fit equation using least squares optimization
plsq = least_squares(fun = residuals, x0 = p0, bounds = (0.0001, 100000), args=(y_meas, x))
  • The community will likely vote to close this post as being wholly about coding. Let me point out an anomaly at the end of your code: it is strange to see the values of the bounds explicitly specified within code that otherwise appears to adapt well to the data. This looks like it's at the root of the error message, and that's no surprise, because there is no basis to suppose that all solutions will universally lie within such bounds. – whuber Dec 04 '19 at 20:15
  • Thank you so much for the suggestion @whuber, and for the heads-up! If it's not allowed, I totally understand. That is a very good point... The bounds I had selected were poorly planned; I had entered them as known bounds for separate variables, though I see now that I am allowed to feed a tuple of array-likes containing lower and upper bounds for each of the parameters. Parameters B and E should be non-zero as I believe that is causing an overflow. The upper limit was chosen as it is an upper limit for the data being read in, and thereby I had assumed an upper limit for D as the max asymptote – Jake Steele Dec 04 '19 at 20:34
  • Additionally, running the optimization without bounds using scipy.optimize.leastsq, as mentioned in the Duke article I linked, DOES give results (something my current code does not do), but the resulting graph is a poor fit and also gives rise to new runtime errors. (optimization gives RuntimeWarnings of "overflow encountered in power," "invalid value encountered in power," and "divide by zero encountered in true_divide." Then graph plotting gives the RuntimeWarning of "overflow encountered in power.") – Jake Steele Dec 04 '19 at 20:40
  • This kind of fitting can be tricky. See https://stats.stackexchange.com/questions/7308 for an example and some discussion. – whuber Dec 04 '19 at 21:03
  • If you are still interested in this problem I think I can show you a stable solution strategy. It would be nice to see a representative data set to test my approach on so I know I'm not wasting your time (and mine). – dave fournier Jan 11 '20 at 03:18
  • Hello, @davefournier. Sorry for the late response. If you are still interested, I would be happy to work through it with you! For a clearer picture and sample XLSX, please see this GitHub repo: https://github.com/jake-steele/tidyPlate Let me know if there's anything else I could provide. I'm still struggling to sort this out and have just been reading more on regression in the mean time. – Jake Steele Jan 27 '20 at 21:29
  • OK, you should understand that I know nothing about your field. What I know about is how to fit a 5PL to data. Sop I think that a data set is a bunch of (x,y) pairs and we want to find a 5PL which "fits" the data. If so I need some examples of various difficulties. If that does not describe what you want please enlighten me, – dave fournier Jan 28 '20 at 00:43
  • Oh, okay! You're exactly right. We typically generate the curve from response values (y-axis, in relative light units/RLUs, from instrument) from a series of serially-diluted standard at known concentrations (x-axis, in units of concentration, ~24 measurements), then use the curve to estimate unknown concentrations based on response values. I have added a small sampling of Std Curves as an XLSX to the [GitHub page](https://github.com/jake-steele/tidyPlate) above, including calculated 5PL values. Industry standard formula viewable [here](https://www.myassays.com/assay.aspx?id=789). Thank you! – Jake Steele Jan 29 '20 at 18:08
  • I fit your first model. I assume we are using least squares for this. My sum of squares for the 24 points for data set 1 0.0415983 and the estimates a -1.1027e-02 b 2.0651e+00 c 6.7676e+01 d 1.6591e+00 m 6.7785e-01 I put your parameter estimates for the first model into the least squares computation and got a sum of squares value of 0.0459915. Your parameter estimates were a 0 b 1.924 c 73.64 d 1.665 m 0.7571 which are different enough to wonder if we are talking about exactly the same model. Please advise. – dave fournier Feb 01 '20 at 19:50
  • Hello again, Dave! I used those numbers to interpolate unknown concentrations and got relatively-similar results. Further, the 125 ng/mL std concentration had a large variation between replicates (Coefficient of variation was ~12.5%), and I have no guarantee that the the current method of curve generation is optimal. That said, your method appears to at least be a step in the right direction! How did you get there, and would you be open to me trying the method on other data sets to compare to my current method? I will update with the results! Thank you so much for your continued work on this! – Jake Steele Feb 03 '20 at 16:30

0 Answers0