5

I hope you can help me get some confidence in my confidence interval... I am trying to get the confidence interval for a particular (threshold) point on a predicted curve.

I find the confidence interval (C.I.) looks funny, so I would like to know whether what I am doing is correct... Thank you for any advice you can give me!

#Here is the data to which I fit a nonlinear least squares regression (power function)
y <-c(0.5745373, 0.8255836, 0.7139635, 0.6318004, 0.8688738, 0.8341626, 0.9278573, 0.6548638,
0.6995722, 0.8410211, 1.0000000, 0.8512973, 0.7917790, 0.5722589, 0.6918069, 0.7117084,
0.5279689, 0.6121315, 0.5869292, 0.7332605, 0.5991816, 0.5470566, 0.5987166, 0.4440854,
0.3719892, 0.4394820, 0.4410862, 0.4966288, 0.4291826, 0.4221613, 0.3951223, 0.3595973,
0.4617549, 0.5106482, 0.5939970, 0.5340835, 0.6036920, 0.5181577, 0.3892170, 0.3667581)

x <- c(0.95149254, 0.57954545, 0.56763699, 0.57089552, 1.00000000, 0.66870629, 0.70833333,
0.53125000, 0.58776596, 0.78061224, 0.61551724, 0.63750000, 0.85000000, 0.52397260,
0.66870629, 0.50328947, 0.52192982, 0.40691489, 0.36016949, 0.37500000, 0.35915493, 
0.53968254, 0.36334197, 0.23486842, 0.19615385, 0.35961538, 0.30039267, 0.26424870,
0.54838710, 0.23363874, 0.26936620, 0.09514925, 0.15324519, 0.32465278, 0.33909574, 
0.31587838, 0.24401914, 0.28813559, 0.23181818, 0.29272959)

plot(x,y, xlim=c(0,1), ylim=c(0,1))
m1 <- nls(y~a*x^b, start=list(a=2,b=0.2))
summary(m1)
curve((0.86888*x^0.43042), add=TRUE)
points(0.2276312,0.4595148, col='red', pch=17) # this is the threshold value I  
                                               # previously calculated

now I would like to know the Confidence Interval of the threshold

#lower CI
0.86888-(2*0.04578) # intercept minus 2* S.E.
0.43042-(2*0.06333) # exponent minus 2*S.E.
curve((0.77732*x^0.30376), add=TRUE) # plot the lower C.I.
0.77732*0.2276312^0.30376 # Now I enter the x coordinate of threshold point in the 
# Confidence interval function, to get the y coordinate
points(0.2276312,0.4958526, col='red', pch="-", cex=1.5) # plotting the CI of threshold value

#Upper CI
0.86888+(2*0.04578) # intercept plus 2* S.E.
0.43042+(2*0.06333) # exponent plus 2*S.E.
curve((0.96044*x^0.55708), add=TRUE) # plot upper C.I.
0.96044*0.2276312^0.55708 # Now I enter the x coordinate of threshold point in the 
# Confidence interval function, to get the y coordinate
points(0.2276312,0.4211113, col='red', pch="-", cex=1.5) # plot the CI of threshold value
lines(c(0.2276312,0.2276312),c(0.4211113,0.4958526), col="red")

# then check whether correct
confint(m1) # more or less, but slightly different, don't understand why?

nls confints

To me, a C.I. that looks like this would make more sense...

curve((0.96044*x^0.30376), col="green",add=TRUE)
curve((0.77732*x^0.55708), col="green",add=TRUE)

NLS_confint2

but that would be mixing the intercept of upper and exponent of lower C.I. and vice versa

My question is then, are the black Confidence Intervals correct (and thus the thresholds red confidence interval?

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
EHR
  • 73
  • 1
  • 6
  • [Cross-posted](http://stackoverflow.com/q/17379401/1412059). – Roland Jun 29 '13 at 12:09
  • The confint function is from the MASS package. Section 8.4 of *Modern Applied Statistics with S* (the manual for MASS) discusses why you are getting different answers than confint. – Tom Jun 29 '13 at 14:05
  • Welcome to the site, @EHR. *Please don't cross-post*. You need to decide if you think of this question as primarily an R programming question (how to use R to do this) or a conceptual statistical question (which is the right CI & why), and then delete the other question. – gung - Reinstate Monica Jun 29 '13 at 15:34
  • @ Tom, thank you for the tip. For anyone else reading this, here is the text http://www.stat.ubc.ca/~d.lee/TA/2011/Texts/MASS.pdf – EHR Jun 29 '13 at 15:38
  • What exactly do you mean by a "threshold"? Is it the value of the curve at a given argument $x$? Or is it the value of $x$ at which the curve attains a given value $y$? – whuber Jul 01 '13 at 22:12
  • @whuber: The threshold is a point on the curve before which y increases more rapidly than x and after which x increases more rapidly than y. It is the value of the curve at x-value 0.2276312, and at y-value 0.4958526. I would like to know the C.I. around this threshold, or, better yet, of all points on the fitted curve (so I would like to know the C.I. around the fitted curve). – EHR Jul 02 '13 at 07:06
  • (1) My calculations, using *your* values of `m1`, find that your threshold occurs where $x=0.178$, not $0.228$. Specifically, this is where $dy/dx = dx/dy = 1$. If that's not what you mean by "threshold," please clarify. (2) The CI around a threshold *which has been estimated from the data* will be (much) wider than a CI around a point determined in advance of the estimation. And what exactly do you mean by a "CI around the fitted curve"? How would you interpret it? (I am trying to see whether you might be looking for *prediction* or *tolerance* intervals rather than confidence intervals.) – whuber Jul 02 '13 at 12:02
  • @ whuber: To calculate the threshold, I calculate where the derivative of the curve equals the derivative of the secant, which is the straight line formed by joining the lowest and highest points on the curve: 0.8688*x. The curve is estimated to be: 0.86888*x^0.43042. Therefore I calculated when 0.37398*x^-0.56958 equals 0.86888. According to this calculation (0.86888/(0.86888*0.43042))^((1/(0.43042-1))) = 0.2276315, so when x=0.228... – EHR Jul 02 '13 at 15:25
  • @ whuber (continued) Would like to understand how you did your threshold calculation. Could you write out your calculation a bit more specific. Not sure I understand it now. by CI around fitted curve, I want to know how much uncertainty there is around 0.86888*x^0.43042, so that I can get an estimate of how uncertain the threshold calculation is (or any other point on the fitted curve). I would interpret it as: how confident I am that the threshold is really at x=0.227. Thanks for your help, I hope you will reply! – EHR Jul 02 '13 at 15:26
  • OK, but that's not how you first characterized that point! At your location $dy/dx \lt 1 \lt dx/dy$. Moreover, because you are calculating your threshold in a way that is critically sensitive to just two (extreme) points in the dataset, a valid CI (or PI or TI) for the threshold is going to be large. Are you sure that's how the threshold needs to be determined? (I simply used the parameter estimates to find where $dy/dx=1$, which seemed to be what you intended when you said, basically, that the "increases" in $x$ and $y$ were equal.) – whuber Jul 02 '13 at 15:26
  • (To appreciate the sensitivity issue, imagine--hypothetically--that the rightmost point were determined to be invalid. Dropping it would not change the fitted curve much, but (due to the unusually low $y$ value of the new right endpoint) the secant would now have such a low slope that your computed threshold would be greater than $x=1$ and beyond the range of the data!) – whuber Jul 02 '13 at 15:30
  • @whuber I am not sure this is the way to do it, but I based it on the Dmax method, which calculates 'heart rate deflection point', by calculating the maximum perpendicular distance from a straight line formed by joining the lowest and highest points on the curve. (Cheng et al. 1992 'a new approach for the calculation of ventilatory and lactation thresholds'). Could you tell me a bit more about what is 'dy/dx=dx/dy=1' and how it yields 0.178. Sorry, am not a math wizard at all :( – EHR Jul 02 '13 at 15:33
  • @whuber, (in reaction to your sensitivity point)... I meant the line between the endpoints of the curve, not the end datapoints... so if you add this to the plot, that is the secant line: curve((0.86888*x), add=TRUE,). so loosing that rightmost datapoint would not change curve, nor the threshold much – EHR Jul 02 '13 at 15:36
  • OK: how are the endpoints of the curve determined? Are their $x$ values established beforehand or are they derived from the data? Perhaps, so that this comment thread doesn't become a long painful search to find out crucial details one by one, you could just tell us what these data are and what this threshold is supposed to mean. – whuber Jul 02 '13 at 15:48
  • the endpoints are derived from the data. How i find the threshold is by finding out where is the maximum distance from the curve to the secant line (to see this secant line, add this code to the plot "curve((0.86888*x), add=TRUE,)" ).Threshold should represent the point before which y increases more rapidly than x, and the point after which x increases more rapidly than y. I am interested in getting the C.I. around the fitted curve. (that would automatically tell me how certain I can be of that threshold, and all other points on the fitted or predicted curve) – EHR Jul 02 '13 at 16:11
  • @whuber Could you tell me a bit more about what is 'dy/dx=dx/dy=1' and how it yields 0.178? Please let me know if my previous comment is making sense to you! – EHR Jul 02 '13 at 17:23
  • 1
    $dy/dx$ refers to the slope of the curve. Calculus tells us the slope of $y=ax^b$ equals $(ba)x^{b-1}$. Obtain your threshold like this: `f – whuber Jul 02 '13 at 17:49
  • I understand that dy/dx refers to the slope, but why must it equal a value of 1 to find the threshold? Do you mean by threshold what I described as threshold (point before which y increases more rapidly than x, and the point after which x increases more rapidly than y)? – EHR Jul 02 '13 at 21:36
  • the x values are measurements, divided by the maximum value of x (same goes for y). So you are right that the maximum of x=1 is not random. By taking every value as a proportion of the maximum observed value does not matter in terms of the values of the fitted curve (which is what I am interested in; how fast does y decline when x has declined by a certain percentage). – EHR Jul 02 '13 at 21:39
  • Re the edit (and introduction of graphs): you seem to want *prediction limits* rather than confidence limits. Please [search our site](http://stats.stackexchange.com/questions/tagged/prediction-interval?sort=votes) for a discussion of the distinctions. – whuber Jul 17 '13 at 13:56
  • yes, you are right i think. I want to predict the interval in which the threshold will fall. As the threshold is a calculated, single point on the predicted curve, it seems I have to talk about prediction limits... – EHR Jul 17 '13 at 16:43

1 Answers1

5

Analysis

According to comments to the question, the "threshold" is determined this way:

I calculate where the derivative of the curve equals the derivative of the secant, which is the straight line formed by joining the lowest and highest points on the curve.

The "lowest" and "highest" points are at the extremes of the $x$ values and

are derived from the data.

Proposed solution

Because the secant is determined by the extremes of $x$, how they are determined is critical.

I therefore suggest estimating the bias in the estimation of the threshold, $t$, as well as a confidence interval for it, using a semi-parametric bootstrap. In this method the model is:

  • The x values, $x_i$, are drawn independently and randomly from a distribution. (It looks like they tend to lie between $0$ and $1$.) This is the "parametric" part of the bootstrap. We will need to explore how the results depend on assumptions about this distribution.

  • The y values, $y_i = a x_i^b$, are computed using the fitted parameters $\hat{a}$ and $\hat{b}$ from the original data in place of the unknown values $a$ and $b$, respectively.

  • To these y values are added "errors" drawn independently and with replacement from the residuals of the original fit. This is the "nonparametric" part of the bootstrap: we have assumed that the vertical deviations of the original data from the correct curve are independent and identically distributed and we have taken the residuals from the original fit to be a representative sample of that distribution.

  • The model is fit to the $(x_i, y_i)$ pairs and a threshold for them is computed as described previously.

The distribution of these thresholds is the bootstrap distribution. Any deviation between its central location and our original estimate of the threshold, $\hat{t}$, gives us an additive bias correction factor. The spread of this distribution lets us erect approximate confidence intervals around the (bias-corrected) estimate.

The bootstrap distribution is estimated by performing a large (but necessarily finite) number of simulations from the model. Usually a thousand or so iterations are enough to give good information.

Example scenarios

Here are results for three distributions of $X$: the original one (in which the $x_i$ never vary), a uniform distribution, and a Beta distribution that approximates the distribution of the original data. $10^4$ iterations were used.

Histograms

Each plot marks the estimated threshold $\hat{t}$ with a vertical red line. Because that line is centered on the "Original X" histogram, we deduce there is little bias and can take the extreme values of this histogram as approximate confidence limits for $t$: roughly from $0.412$ to $0.433$ (95% confidence).

The second and third plots show how exquisitely sensitive $\hat{t}$ is to the distribution of the $x_i$ and how biased an estimator it might be: notice how much wider the range of threshold is on their horizontal axes. In the second plot, where it is likely the smallest $x_i$ is close to $0$ and the largest $x_i$ might be substantially less than $1$, the reference slope ("secant") is relatively high, pushing the estimated threshold to the left where the curve has greater slopes. The bias is large, approximately $+0.12$, merely because the largest $x_i$ in the dataset is $1$ and the smallest, $0.095$, is fairly far from zero. In the third plot the bias has disappeared (!) but the range of estimated thresholds is still wide. A symmetric 95% confidence interval for $t$ is $[0.33, 0.52]$ (based on $10^4$ iterations). I suspect the lack of bias may be a bit of luck because the $x_i$ are centered fairly close to the original $\hat{t}$. This can be confirmed by drawing the $x_i$ from other distributions, such as a Beta(4,1) distribution (centered near $2/3$): the bias is now $-0.21$.

This is about as far as we can reasonably take the analysis without knowing more about how the $x_i$ are determined. The results are interesting insofar as they reveal how important it is to understand how the data arise: there is a huge range of answers available here, all of which are consistent with the numbers, but only some of which could be appropriate for what actually occurred during this study.


Code

R code to produce these results follows. (Change n.iter to obtain other than $10^3$ iterations.)

y <-c(0.5745373, 0.8255836, 0.7139635, 0.6318004, 0.8688738, 0.8341626, 0.9278573, 0.6548638,
      0.6995722, 0.8410211, 1.0000000, 0.8512973, 0.7917790, 0.5722589, 0.6918069, 0.7117084,
      0.5279689, 0.6121315, 0.5869292, 0.7332605, 0.5991816, 0.5470566, 0.5987166, 0.4440854,
      0.3719892, 0.4394820, 0.4410862, 0.4966288, 0.4291826, 0.4221613, 0.3951223, 0.3595973,
      0.4617549, 0.5106482, 0.5939970, 0.5340835, 0.6036920, 0.5181577, 0.3892170, 0.3667581)

x <- c(0.95149254, 0.57954545, 0.56763699, 0.57089552, 1.00000000, 0.66870629, 0.70833333,
       0.53125000, 0.58776596, 0.78061224, 0.61551724, 0.63750000, 0.85000000, 0.52397260,
       0.66870629, 0.50328947, 0.52192982, 0.40691489, 0.36016949, 0.37500000, 0.35915493, 
       0.53968254, 0.36334197, 0.23486842, 0.19615385, 0.35961538, 0.30039267, 0.26424870,
       0.54838710, 0.23363874, 0.26936620, 0.09514925, 0.15324519, 0.32465278, 0.33909574, 
       0.31587838, 0.24401914, 0.28813559, 0.23181818, 0.29272959)
#
# Fit a curve y = a*x^b to the data using nonlinear least squares,
# compute the "threshold," and return its value.
# (Called by `sim`.)
#
fit <- function(x, y, a=1, b=0.5) {
  m <- nls(y ~ a * x^b, start=list(a=a, b=b))
  coeff <- m$m$getPars()
  f <- function(x) coeff[1] * x^coeff[2]
  fp <- function(x) coeff[1] * coeff[2] * x^(coeff[2]-1)
  x.min <- min(x)
  x.max <- max(x)
  secant <- (f(x.max) - f(x.min)) / (x.max - x.min);
  threshold <- uniroot(function(x) fp(x) - secant, c(0,10^6))$root #$
  return(threshold)
}
#
# Perform a bootstrap simulation with the specified model:
# `n` is the number of iterations
# `f` is the original estimated curve (as a vectorized function)
# `e` is the array of residuals to the initial fit (it defines an empirical
#     distribution of residuals)
# `alpha` and `beta` are parameters for a Beta distribution of the x[i]
# `x` is the array of x[i] to use when not simulating the x[i].
#
sim <- function(n = 1000, f, e, alpha=1, beta=1, x=NULL) {
  replicate(n, {
    if (is.null(x)) {
      x.0 <- rbeta(length(e), alpha, beta)     
    } else {
      x.0 <- x
    }
    y.0 <- f(x.0) + sample(e, length(e), replace=TRUE)
    fit(x.0, y.0)
  })
}
#
# Obtain the initial fit.
#
m1 <- nls(y ~ a * x^b, start=list(a=1, b=0.5))
#
# Compute the residuals `e` and the curve function `f`.
#
e <- residuals(m1)
coeff <- m1$m$getPars()
f <- function(x) coeff[1] * x^coeff[2]
#
# Prepare to bootstrap and plot several scenarios.
# Each scenario should take 2-3 seconds for n.iter=1000.
#
par(mfrow=c(1,3))
n.iter <- 10^3
b <- seq(0.15, 0.65, 0.025)
set.seed(17)
#
# Scenario 1: keep the x values fixed.
#
threshold <- sim(n.iter, f, e, x=x)
hist(threshold, main="Original X", freq=FALSE)
abline(v = fit(x, y), col="Red")
quantile(threshold, c(0.025, 0.975)) # A symmetric CI
#
# Scenario 2: vary the x values uniformly in [0,1].
#
threshold <- sim(n.iter, f, e, alpha=1, beta=1)
hist(threshold, main="X ~ Uniform(0,1)", freq=FALSE, breaks=b)
abline(v = fit(x, y), col="Red")
quantile(threshold, c(0.025,0.5, 0.975))
#
# Scenario 3: vary the x values according to Beta(4,4).
#
threshold <- sim(n.iter, f, e, alpha=4, beta=4)
hist(threshold, main="X ~ Beta(4,4)", freq=FALSE, breaks=b)
abline(v = fit(x, y), col="Red")
quantile(threshold, c(0.025, 0.975))
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Thank you @whuber, for taking the time to explain all this! I had started with the bootstrapping, but am a real beginner bootstrapper... Still do not fully understand though how you get the threshold from the secant line. I think our secant lines differ?. Would you be able to plot the curve and secant line with the bootstrapped CI's? Then we can compare. I would do it if I knew how to use/plot the CI's you produced... – EHR Jul 04 '13 at 07:25
  • This is also interesting! http://quantitativeconservationbiology.wordpress.com/2013/07/02/confidence-interval-for-a-model-fitted-with-nls-in-r/ – EHR Jul 04 '13 at 12:23
  • The threshold is computed by `fit` by finding the x value at which the slope of the fitted curve (calculated by `fp`) equals the slope of the secant (stored in `secant`). I can't compare this to your method because you haven't indicated how you "previously calculated" the threshold for your data and your fit. – whuber Jul 04 '13 at 14:37
  • Ok, but then it should be the same, because that is also what I do when calculating the threshold (see comment above). To calculate the threshold, I calculate where the derivative of the curve equals the derivative of the secant. The curve is 0.86888*x^0.43042. Therefore I calculated when 0.37398*x^-0.56958 (derivative of curve) equals 0.86888 (derivative of secant). Then, x^-0.56958=0.86888/0.37398, so x^-0.56958=2.32333, making x=2.32333^(1/-0.56958) = 0.2276. Not sure then why it is not the same as your result... – EHR Jul 05 '13 at 06:27
  • You are not computing the secant in the way you stated in your comments! You wrote it is "the straight line formed by joining the lowest and highest points on the curve." In response to a question about the domain of "the curve," you wrote "the endpoints are derived from the data." Well, your data go from `min(x)`, which equals 0.095, to `max(x)`, equal to 1. Those points are (0.095, 0.316) and (1, 0.867). The slope of the secant therefore equals (0.867 - 0.316) / (1 - 0.095) = 0.61. If you meant something else, just change the calculation of `secant` in the `fit` function. – whuber Jul 05 '13 at 13:01
  • I apologize for creating the mix-up, you are right I should have said the lowest and highest points OF the curve. I have now changed the secant line to secant – EHR Jul 06 '13 at 08:00
  • Yes: leave out the `breaks` option in `hist`. But `coeff[1]` is not the slope of the secant. What do you mean by points "of" the curve? Exactly what do you maintain is the domain of your curve? (The formula applies to all non-negative real values, no matter what the coefficients might be.) – whuber Jul 07 '13 at 12:29
  • Yes, coeff[1] is the derivative of the slope of the secant, which is what I set equal to the derivative of the curve...By points 'of the curve' I mean the end points of the curve (on the x axis); so the domain of the curve is set by these end points of the curve, and that also goes for the secant line. Maybe easier to plot so you can see! – EHR Jul 08 '13 at 08:45
  • plot(x,y, xlim=c(0,1), ylim=c(0,1)); m1 – EHR Jul 08 '13 at 08:48
  • It is now clear you are assuming the domain of "the curve" $f$ *begins at* $x=0$. The "secant" you refer to passes through the points $(0,0)$ and $(1, a)$, and *not* through $(x_\text{min}, f(x_\text{min}))$, $(x_\text{max}, f(x_\text{max}))$. – whuber Jul 08 '13 at 12:48
  • Thank you for all your help whuber! Was wondering, would you know how I can fix when I get the following error message "sim(n.iter, f, e, x=x) Error in numericDeriv(form[[3L]], names(ind), env) : Missing value or an infinity produced when evaluating the model", after running the scenario 1 code "threshold – EHR Jul 08 '13 at 18:12
  • Dear whuber, I was wondering whether you could give me a little bit more explanation of how the bootstrap works exactly in my example. Also I would like to acknowledge your help in my manuscript. If you don't mind, can you send your email address to ereuchlin@gmail.com so I can get your name, position, institute? – EHR Jul 17 '13 at 09:06
  • (1) What do you need the explanation of? After all, the code provides all the details. (2) The numeric derivative is probably computed by `uniroot` (or possibly by `nls`). Its occurrence is due to your particular data, so I cannot readily reproduce it. One approach to debugging is to select an alternative root-finding method, preferably one that does not compute derivatives. If the problem lies with `nls`, try fitting the model `log(y) ~ log(x)` instead. – whuber Jul 17 '13 at 13:15
  • @ whuber;thanks for your help. 3 last questions, hope you have the time and energy to help me with! Would you be able to answer – EHR Jul 27 '13 at 13:59
  • 1) why a semi-parametric vs. a parametric bootstrap? 2) I was also wondering whether choosing the third scenario (Scenario 3: vary the x values according to Beta(4,4)) is preferred over the other two, i.e. what are the benefits of selecting this one over the others. My guess is that using the distribution of my original data will result in a confidence interval that is more representative of reality. Also the confidence intervals appear somewhat larger and thus a little bit more precautionary... 3) what does Beta(4,4) indicate? – EHR Jul 27 '13 at 14:00