Restricting data range considerably changes my linear mixed model fit - how can I make sure my model is accurate?

Question

Up to now I have the following lmer() model fit from the lme4 package in R:

fit<- lmer(log(log(Amplification)) ~ poly(Voltage, 3) + 
           (poly(Voltage, 3) | Serial_number), data = APD)

while the data contains:

Serial_number Lot Wafer Amplification Voltage Dark_current
24 912009913 9 912 2.26756 189.957 -0.168324 
25 912009913 9 912 2.66278 199.970 -0.185998
(...)
74 912009913 9 912 1903.74000 377.999 2.941360
97 912009897 9 912 2.33190 189.974 -0.200348
98 912009897 9 912 2.75746 199.984 -0.222305
(...)

The data consists of a lot of electronic devices which were all measured at four similar measurement stations. The stations are the same but the devices were only measured once at one of the four stations. This means e.g. device1 was measured at station3 and device2 at station1. The measurement itself was done via increasing the voltage step by step and measure at each step the actual amplification of the current. So each device had one measurement procedure. Additionally, all devices are of the same type (ideally they would be the same, but due to manufacturing process variability, they differ individually). A little complication arises due to the fact that the measurement procedure is unknown to me, since an external lab performed that. Furthermore, the procedure changed over the years, but in ways which are also unknown to me.

My goal is: I want to fit each device (Amplification against Voltage) and I need the fit as reliable/authentic as possible. Finally, if that's important, I'm only interested in the region of about Voltage={300,450} because I need the corresponding voltage to Amplification=150. Hence, this is a region were two major phenomena happen: the devices become very sensitive and at an even higher voltage they become extremely sensitive, i.e., Amplification has a vertical asymptote (maybe it can be understood as a so-called infinite resonance). Higher voltages are never tested because they would destroy the device. I need the best/most reliable fit to gain the most authentic operating voltage for each device. As mentioned, the motivation is to operate the devices at the same amplification and the amplification changes with the voltage. Therefore the operating voltage is very important.

Minimal data set and script: https://files.fm/u/5yy22kkm

Some plots and further information: A data set of about 150 devices (in total I have about 1500 devices):

A data set of only one single device:

Other details: the variables "Wafer" and "Lot" are grouping variables for the devices. That means that about 20 devices come from the same Wafer and e.g. 200 devices come from the same Lot.

A plot of the data points for all devices, after taking the $\log{(\log{()})}$ of Amplification:

I subtracted a lot of data points (all those with Amplification < 100 or > 200) and got this for all devices:

Here an inverse regression to calculate the corresponding voltage for a given Amplification=150:

This fits the measurement data well.. and the same for more devices. The remaining problem now is: A limitation of the data range has an major influence (of course) in such a way that the calculated voltage differs in a range of about nearly 1.0 V:

Now I need to know: What is the really most reliable voltage? When I use the whole data range which results in a (A=150, V(A=150)-data pair that is beside the measurement data? Or shall I limit the data range in such a way that the calculated/reconstructed data pair moves closer and closer to the measurement data? What is more authentic?

I am fairly certain that you have to provide a lot more information to have someone help you with your question. I'd suggest you have a look [here](https://stackoverflow.com/a/5963610/4308815) how to produce a minimal reproducible example. — Stefan, Oct 26 '17 at 14:57
The residuals show a pronounced dependency structure. Based on the plot I would say, that your model does not adequately describe the data, — Roland, Oct 27 '17 at 06:16
@Roland I also think so. Probably the problem arises at high x- relatively at high y-values. Anybody an idea how I can find the right model respectively the right formula? — Ben, Oct 27 '17 at 06:38
@Ben: For a couple of serial numbers, check the scatter plot of response versus voltage. Are the functional relationships looking like a polynomial of degree 3 (big doubts here!)? Or do you need more flexibility, e.g. by fitting natural cubic splines? — Michael M, Oct 27 '17 at 07:01
@MichaelM I added a few plots of the whole data (natural and logarithmically transformed) and a single subset. Based on the logarithmic plot I thought I need polynomials? Probably the biggest problem occurs when the data points rise. This part is also the region I'm interested in the most. Finally I only care about the region between Voltage={300,450}. — Ben, Oct 27 '17 at 07:09
Thanks, I will do this! If I understand your hint/question right: Yes, there's a very strong and sudden change in the relationship between voltage and amplification. Technically, only for high voltages. This behaviour is the reason why this devices exist (jFYI). — Ben, Oct 27 '17 at 12:42
I just noticed that you regress `Amplification` on `Voltage`, and then you have to turn the whole thing around on its head to get `Voltage=f(Amplification=150)`. Why don't you directly regress `Voltage` on `Amplification`? The two regressions are **not** equivalent, and the second one seems much more suited to your goals. — DeltaIV, Oct 29 '17 at 12:31
Because it's not possible with lmer. Or did I miss something? Other way around: As far as I understood I can only predict on y with lmer but I need to predict on x which is not possible(?). — Ben, Oct 30 '17 at 18:17
I'm not suggesting to regress $y$ on $x$ and then predict $x$ from $y$. I'm suggesting to regress directly $x$ on $y$. I also don't understand why you're so bent on using `lmer` - if it's not the right tool for your research, just drop it and backtrack. In research this happens all the time - don't fall prey to the sunk cost fallacy. I'll try to write an answer later this week. — DeltaIV, Oct 31 '17 at 08:46
Sorry, I'm not really into these differences. How do I change regression from y on x to x on y? I'm not bent on lmer (ok, a bit since I invested so much time to understand and get it :) but my lecturer came up with a more simple idea: https://stackoverflow.com/questions/47021681/why-or-how-to-scale-the-data-in-this-way — Ben, Oct 31 '17 at 09:03
But maybe lmer is more suitable because the measurements have different uncertainties and the measurement procedure changed from time to time over the years. But I'm not sure if I can gain somehow from that? In the end: It's better having reliable/authentic regressions than ones who match the data perfectly. Is there something that can handle this (lmer or something else)? Or is it everytime just getting the fit as close as possible to the data? — Ben, Oct 31 '17 at 09:05
Ok, I'm no aware of the difference between regression y on x and x on y. Now the question arises how do I know which makes more sense? And isn't it also possible to make a regression x,y on x,y? — Ben, Oct 31 '17 at 13:07
@Ben I agree 100% with you that "it's better having reliable/authentic regressions than ones who match the data perfectly" (provided we define carefully what it means for a regression to be " reliable/authentic "). A regression which fits the data perfectly is often just overfitting to the noise in the measurement. However, since you explicitly require in your question that you "need a very, very high precision of a fit to gain the most authentic operating voltage for each device", I understood this as saying two things: 1) you have extremely accurate measurement data for each device... — DeltaIV, Oct 31 '17 at 15:55
...and thus you can build a **separate regression model** for each device. 2) you need extremely high accuracy in the fit. However, you then say that the " the measurements have different uncertainties and the measurement procedure changed from time to time over the years". Sure, if your measurements are not as accurate as I thought, then of course you cannot get an extremely accurate fit for each device separately, and pooling estimates from different devices (i.e., using `lmer`) may make sense. Frankly I'm getting confused at this point. 1) Do you still need an answer to this question... — DeltaIV, Oct 31 '17 at 16:02
...or should I read your newer question? 2)If you're still interested in this question, you may try to edit the question, shortening it and clarifying it. It's now very long, and not very clear (for me: other users may beg to differ). — DeltaIV, Oct 31 '17 at 16:03
ps just one final comment (there are already too many, we risk being forcibly moved to chat): whether it makes more sense to regress $x$ on $y$ or $y$ on $x$ depends mostly on your research goals (which, if I understood them correctly, strongly suggest regressing $x$ on $y$), and also on the uncertainty with which $x$ and $y$ are measured. For reasons that are too long to explain, usually we regress the variable with more uncertainty, over the variable with less uncertainty (ideally, no uncertainty). — DeltaIV, Oct 31 '17 at 16:18
@DeltaIV Thanks for your detailed answers! Shall I answer here though it's many comments now? At first I do that: The background of the measurement procedure is very unclear since an external lab did that. I only know that the lab changed its procedure (but also not in which way). Hence, I have absolutely no knowledge about the uncertainties.. Currently I tend to use the lmer model to calculate the operating voltage. But still I need to determine how I know what is the best model since I know that the data range influences the calculated voltage. So how do I know which range to choose? — Ben, Oct 31 '17 at 18:22
@DeltaIV Can you also please have a look at https://stats.stackexchange.com/questions/310989/how-to-model-repeated-measurements-for-two-different-scenarios Thank you in advance! — Ben, Oct 31 '17 at 18:45
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/67978/discussion-between-deltaiv-and-ben). — DeltaIV, Oct 31 '17 at 19:39
What does it mean 'restricting data range'? I could not find where you described this. Does it mean that you select a smaller range of voltages or a smaller range of devices? Is the effect that you describe in your question reproducible with your attached data? — Sextus Empiricus, Nov 16 '17 at 17:52
Yes, restricting means a smaller range of (Amplification, Voltages)-pairs in the measurement data serios of each device. Yes, it is reproducible (there is the whole code from the data up to extracting the coefficients to apply an inverse regression). — Ben, Nov 17 '17 at 08:40
Your problem can be summarized in a single graph. The attached script in the zip file is 209 lines. Is there a smaller example? Or could you make one? — Sextus Empiricus, Nov 22 '17 at 18:31
Most of it is the calculation for the inverse regression. I guess a better programmer can do this easily but, to be honest, I'm even glad I got it so far. — Ben, Nov 22 '17 at 18:58

score 4 · Answer 1 · edited Jun 11 '20 at 14:32

4

The functional relationship between Amplification and Voltage is very strong and perfectly monotonically increasing. Check the following grouped scatterplots made by

library(ggplot2)

ggplot(APD, aes(x=Voltage, y=log(log(Amplification)))) +
  geom_point(size=2, shape=23) + 
  facet_wrap(~Serial_number) + 
  geom_smooth(method = "lm", formula = y~poly(x, degree))

The blue lines are the result of fitting an individual polynomial curve of degree 3 to each plot. Overall, they fit quite well. But e.g. at the right tails, some of the curves would extrapolate in the completely wrong direction. This is a general problem when fitting polynomials.

The mixed effects model that you have specified will provide very similar results than above individual curves. There, basically an "average" polynomial is estimated from all serial numbers and then, the per serial number deviations from this average are represented by random effects. So let's replace the blue curves by the predictions of your model:

fit <- lmer(log(log(Amplification)) ~ poly(Voltage, degree) + (poly(Voltage, degree) | Serial_number),
            data = APD)
# plot(fit) # producing your residual plot

APD_pred <- transform(APD,
                      Amplification = exp(exp(predict(fit, APD))))

ggplot(APD, aes(x=Voltage, y=log(log(Amplification)))) +
  geom_point(size=2, shape=23) + 
  geom_point(size=1, shape=22, data = APD_pred, col = "red") +
  facet_wrap(~Serial_number)

Non-convergence warning is worrysome...

The overall impression is: The fit is not bad except at the boundaries. There, the curvature is sometimes quite wrong. How does this fit to your residual plot, which looks quite catastrophic? Check the scale of the vertical axis: Actually, the residuals are very close to 0. The "ugliness" thus comes from the fact that the functional relationships are so very strong. The chosen degree 3 polynomial is not flexible enough to capture it in a perfect way.

Now you basically have two action plans:

Accept the compromise between choosing a simple model and not having a perfect fit.
Try different models. E.g. using just one instead of two logs. Or fitting a random shift natural cubic spline.

edited Jun 11 '20 at 14:32

Community

1

answered Oct 27 '17 at 09:28

Michael M

10,553
5
27
43

2

If you are worried by 'weird' polynomial fits at the tails, maybe restricted cubic splines are the right way to go (or are those the same as the natural cubic splines suggested in the answer)? – IWS Oct 27 '17 at 09:33
Thanks a lot for the detailed answer! As a fit for the data is very important for me I would invest some more research to find a proper model. Can you tell me how to enter a random shift natural cubic spline? I tried a bit with "bs(Voltage,x) x={1,2,3}"-terms in the random part but this seems to be something different(?). – Ben Oct 27 '17 at 09:39
1

why not using `mgcv`, i.e., GAMs, instead than polynomials? – DeltaIV Oct 27 '17 at 09:56
@DeltaIV What's the advantage? Though I can add some gam plots. – Ben Oct 27 '17 at 09:59
@Ben the advantage is that you don't need to specify the functional form, but you learn it from data. I'm in a pinch and I can't write a proper answer this weekend, but basically I see two problems with your model: one is that you have serially correlated errors (and here GAMs can't help, but GAMs plus random effects can: `mgcv` allows for both). The other is that you're "forcing" a polynomial relationship between your variables, which is wrong, otherwise your residuals should show random scatter, over a distance longer than their average autocorrelation (see your first plot). Have a look... – DeltaIV Oct 27 '17 at 11:33
...at [this](https://github.com/gavinsimpson/gams-yorku-canada-150/blob/master/slides.pdf) and note that you don't have to limit yourself to main effects. Even though strictly speaking GAMs should be additive, i.e., no interactions among covariates, in practice low order interaction are easily included using `te()`. – DeltaIV Oct 27 '17 at 11:39
@DeltaIV Thanks for the input! The talk looks nice (well explained and for a right level for application :) ). Will work through it. But nevertheless, may I though ask in advance how a gam model could look like to apply my data? Just that I have something to start from. – Ben Oct 27 '17 at 11:45
@DeltaIV Do I change the so-called "Wigglyness" with that k-factor in "s(Predictor, k ="?")"? And can you please tell me how I do measure/visuzalize the "Wigglyness"? – Ben Oct 27 '17 at 12:37
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/67768/discussion-between-deltaiv-and-ben). – DeltaIV Oct 27 '17 at 12:47
I would like to try Michael's 2nd advice: Can you please tell me how do I fit a random shift natural cubic spline (resp. what is that/ is it something within lmer?)? – Ben Nov 02 '17 at 12:18

Sextus Empiricus · Accepted Answer · 2017-11-28T15:15:27.493

Short answer: When reducing the size of the range for a linear/polynomial fit (which is only an approximation of the true relationship) then you change the characterization of the fit from underfitting - to correct fitting - to overfitting.

Long answer:

Why does changing the range have an effect

The change of the range has an effect of changing the prediction at A=150... because the polynomial model is not accurately describing the data. This leads to, large residuals (which are systematic errors, and not noise), with unreliable behavior and the range plays a role in this.
- Your polynomial fit in the large range is bad. This leads to unreliable behavior.
  
  Especially since you use a linearized fit $$log(log(A)) \sim polynomial(V)$$ you get that the residuals at low amplification have increased weight.
  
  For instance, a difference between 1.001 and 1.0001 is tiny on a linear scale but 33% difference on a log(log())-scale. A difference between 1000 and 1100 is large on a linear scale, but les than 1% on a log(log()) scale.
- Your polynomial fit in the short range is very good.
  
  You could say that locally the relationship can be well approximated by a linear relationship or low order polynomial.
- So this basically makes that the shift from the one range (large) to the other (small) is changing (improving) your estimate.
The image below demonstrates this effect better.
- Note the difference between the green curve which fits well, and the red curve which fits badly. This difference is a bit less clear in your diagrams which shows only one single point, and does not explain what is going on (what is going on is that the red curve polynomial fit is not a good one)
- Note that the red curve fits actually reasonably in the lower amplification range, that is because the effect explained above: the curve is linearized which increases the weight of the residuals at low values (The standard Excel has the same issue and linearizes when performing an exponential curve fit, which is not always providing satisfying results).
- I have added an additional curve (the gray broken line), based on a differential equation, anticipating an answer on DeltaIVs question If in this problem I regress $x$ on $y$ instead than $y$ on $x$, do I need to use an error-in-variables model? . The function $A \sim \frac{1}{1-\left( \frac{V}{\theta_0} \right)^{\theta_1}}+\theta_2 +\epsilon$ that is mentioned in that question is not converging well, and trying to fit it anyway seems to be an ill-posed problem. What I found is that an alternative does behave much better, namely $\partial A (a (A-1)^b + c (A-1)^d)^{-1} = \partial V$, fits quite well.

What is the best thing to do in this situation

Since you have very little noise and the curve is smooth you can use your polynomial fit in the short range, without much troubles. You could put in place some diagnostic checks to see if the fits behave well and improve robustness (e.g. check if there are no outliers that may incidentally mess up your results).
The polynomial fit is already such a control to improve robustness. Contrast this with an even simpler and still good model, which would be to calculate the voltage at A=150 just by using the line between the two data points around it. This works very well with no errors, except if occasionally one of those points has an error.
If you would have more noise, such that fitting only a few points in a small range is creating a large error, then you might like to use more of the surrounding data. However, this only helps if the model is good. You could say that the use of the larger data range to predict the value at a single point, is effectively extrapolating. This requires you to be much more careful. The fit in the small range can be seen as interpolating.

A note on the use of mixed effects.

I wonder whether your use of mixed effects is appropriate (not that in your case it creates so much difference). Mixed effects means that your model assumes that the effects of the different device series-numbers is a random effect and that it follows a normal distribution. This is not correct to do when this effect is not a normally distributed random error. For instance if you have a few devices with errors, the mixed effect model is going to predict these errors to be closer to the average than reality (based on a less effective model to detect differences between devices, because it is maximizing a likelihood which assumes a normal distribution of the errors and thus you say it is likely for the errors to be zero and push your estimates to an average).

The small R code below demonstrates the effect of random effects.

> # simple example to demonstrate what happens with random effects
> 
> # data
> x <-           c(-4, -1, 1, 4, 10, 10.2) 
> t <- as.factor(c(1 , 1   ,  2,  2, 3, 3))
> 
> # model
> fit <- lmer(x ~ 1 + (1|t))
> 
> # coefficients tend to be closer to the mean coefficient 
> coef(fit)
$t
  (Intercept)
1   -2.280899
2    2.532367
3    9.848532

attr(,"class")
[1] "coef.mer"
> 
> # group residuals do not sum up to zero
> residuals(fit)[c(1,3,5)]+residuals(fit)[c(2,4,6)]
          1           3           5 
-0.43820248 -0.06473446  0.50293694 
>

Thanks a lot for the detailed and well explained answer/solution! I'll now try to realize it. Due to a better understanding because I got lost in between all the models: When do I have to use a mixed model? I guess I got your point relative to the errors but I wonder the following "They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study).." ( https://en.wikipedia.org/wiki/Mixed_model ). This description fits perfect the situation so why wasn't it as good as expected? Resp. how to know that in the future? — Ben, Nov 25 '17 at 15:12

Restricting data range considerably changes my linear mixed model fit - how can I make sure my model is accurate?

2 Answers2

Why does changing the range have an effect

What is the best thing to do in this situation

A note on the use of mixed effects.

Linked