5

Background

Errors-in-variables models are defined as:

regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

Whether data is experimental or computational (e.g. via physics-based simulation), uncertainty in the independent variables is common. This is particularly relevant in materials science where uncertainty exists in the processing, structure, and property measurements. Additionally, this uncertainty is not necessarily constant for all parameters (a temperature sensor might have higher uncertainty at the limits of the temperature range, various aspects of a material's structure may have been determined by separate instruments, property measurements are data-mined from several sources, etc.). See also Uncertainty Quantification. I'm sure this is relevant for other disciplines too (e.g. physics, chemistry, biology, engineering).

Existing Methods

What I'm not looking for

What I am looking for

A method that caters to multidimensional, non-parametric regression with propagated measurement uncertainty in predictors and responses (i.e. uncertainty propagation, not just weighting the points) and preferably software that goes along with it (Mathematica, MATLAB, Python, R, Stan, etc.). Multidimensional refers to the predictors (i.e. responses are scalar).

Ideally, it would be something that is explained and can be performed via:

mdl = some_function(X, Xsd, y, ysd)
ypred, ysd2, cov = mdl(X2, Xsd2)

Inputs

  1. Matrix/Array of predictors (X $X$)
  2. Matrix/Array of predictor uncertainties (Xsd $\sigma_X$)
  3. Vector of response values (y $y$)
  4. Vector of response uncertainties (ysd $\sigma$)
  5. Matrix/Array of new predictors (X2 $X_2$)
  6. (Optional) Matrix/Array of new predictor uncertainties (Xsd2 $\sigma_{X2}$)

Outputs

  1. Vector of new responses (ypred $y_{pred}$)
  2. Vector of new response uncertainties (ysd2 $\sigma_2$)
  3. (Optional) predictor covariance matrix (cov $\Sigma_2$)
Assume uncertainty means standard deviation of a normal distribution. I'm fine with starting out with simpler assumptions before moving onto e.g. other distributions.

Related

Sterling
  • 59
  • 6
  • 1
    Questions that are only about software (e.g. error messages, code or packages, etc.) are generally off topic here. If you have a substantive machine learning or statistical question, please edit to clarify. – gung - Reinstate Monica Mar 29 '21 at 17:34
  • 2
    @gung-ReinstateMonica I've updated the question to focus on the method rather than solely on a software implementation. Thanks for pointing out. – Sterling Mar 29 '21 at 18:20
  • Thanks, this seems sufficiently on-topic to me now (+1). I'm not an expert in EiV models by any stretch, but while there may be canned routines, my understanding is that the most general & preferred method is to use [tag:sem]. – gung - Reinstate Monica Mar 29 '21 at 18:26
  • 1
    Ok, great. Thanks for the patience. Interesting, I hadn't heard of sem before. I'll look into it! – Sterling Mar 29 '21 at 18:32
  • 1
    In addition to Bayesian modeling which would be a very flexible option, the closest to a canned solution is the SIMEX algorithm (https://sci-hub.se/10.1080%2F01621459.1994.10476871) implemented in the simex package in R (https://cran.r-project.org/web/packages/simex/index.html). More broadly, the book you want to start with is: https://drcarroll.wpengine.com/eiv.SecondEdition/. SEM is unlikely to help here, it is useful with multiple uncertain measures of an unknown predictor, e.g. unknown predictor A w/ multiple measures A1, A2, A3, ... unknown predictor B w/ multiple measures B1, B2, B3, ... – Heteroskedastic Jim Mar 29 '21 at 18:53
  • 1
    I disagree w/ @HeteroskedasticJim, [SEM](https://en.wikipedia.org/wiki/Structural_equation_modeling) is very much not limited to estimating latent variables measured by manifest variables, although that is one of the most common applications. Setting that aside, the Wikipedia page notes that, "Usually measurement error models are described using the latent variables approach". – gung - Reinstate Monica Mar 29 '21 at 19:44
  • 1
    @gung I disagree. SEM is mostly based on a series of equations called the LISREL model. There is no machinery in the these models for including known error variance of a predictor when modeling. As someone who studies SEM, I can say with confidence that most SEM researchers would not know how to deal with OP's problem. I can also say: the only approach to dealing with measurement error within SEM software is the repeated indicators latent variable approach. Other methods that may be included in SEM e.g. adjustments for attenuation either predate SEM, or come from outside, e.g. GLLAMM software – Heteroskedastic Jim Mar 29 '21 at 21:01
  • 1
    Just to be specific, I mean no machinery for known varying error variance in a predictor. – Heteroskedastic Jim Mar 29 '21 at 21:21
  • This question is very broad. The answer 'use `rstan`' seems to be what this question asks for. The particular implementation goes along with a large amount of literature explaining all sorts of things. How can the answer be improved (more specific)? – Sextus Empiricus Mar 30 '21 at 06:45
  • Also, the particular application is not so clear. If this is a materials science problem with a costly computation, then sampling the posterior might not be efficient and often a simpler monte-carlo similation is done to estimate/visualize the uncertainty. The [ensemble forecasting](https://en.m.wikipedia.org/wiki/Ensemble_forecasting) used in meteorology is an example. In that case one may have an additional parameters that generates uncertainty which is the grid size of the model/geometry (I imagine a similar problem occurs in materials science). – Sextus Empiricus Mar 30 '21 at 06:59
  • @Sterling what is the difference between X and X2? Is X2 the same as X but with new observations? Why does X2 have no asociated measurement error? Are these well measured values? – Heteroskedastic Jim Mar 30 '21 at 14:19
  • @SextusEmpiricus @Björn's answer was very interesting to get me looking in some direction. It could be improved with a snippet of Stan code that's a (tested) combination of code from [Regression with Measurement Error](https://mc-stan.org/docs/2_22/stan-users-guide/bayesian-measurement-error-model.html) and [GP with a normal outcome](https://mc-stan.org/docs/2_22/stan-users-guide/fit-gp-section.html) in the `Stan` docs (or `brms` code or equivalent that achieves the same). This is as far as I am right now. – Sterling Mar 30 '21 at 21:09
  • @SextusEmpiricus the ensemble forecasting is an interesting read. I've considered using slice samplers before. With a Monte Carlo scheme, the predictor uncertainty ($\sigma_X$) could be propagated to output uncertainty ($\sigma_2$), but the predictions ($y_{pred}$) would be unaffected, correct? I've also considered using neural networks (see a couple added links at end of question). I'd also like to point out that I'm not married to the idea of using `Stan`, but it's certainly interesting. – Sterling Mar 30 '21 at 21:32
  • @HeteroskedasticJim great question. An unstated assumption I realized I was operating on is that I/we are often looking for a surrogate model, in which case $X_2$ may be synthetic. I certainly wouldn't consider it to be a bad thing if there was a $\sigma_{X2}$ (hence adding that as optional). I could imagine using that if someone characterized the predictors ($X_2$ and $\sigma_{X2}$) and wanted to figure out if it was worth doing an expensive test to get the property measurement, or seeing what kind of experimental uncertainty they'd need to achieve to get a tighter bound on $y$. – Sterling Mar 30 '21 at 21:53
  • In https://github.com/facebook/Ax/issues/751, someone [mentioned](https://github.com/facebook/Ax/issues/751#issuecomment-990382931) https://botorch.org/tutorials/risk_averse_bo_with_input_perturbations, which I think is pretty relevant. At some point, I may give this a better look and try to come up with a generic example that matches this problem. For those unfamiliar with [Ax](https://ax.dev/), it is a python-based experimental adaptive design platform (from Facebook) that uses BoTorch as a backend (I am not affiliated with either). I recommend it. – Sterling Dec 09 '21 at 23:14

1 Answers1

7

One of the more interesting choices in R is rstan, where you could code this up yourself in the Stan modeling language (which tends to be amazing in that it can produce inference for models that we used to be unable to do for a long time). However, getting started can be a little challenging and it sounds like you'd like a higher level interface.

That could be the brms package, that uses the R model syntax and in the background generates Stan code. Via that detour (generate the Stan code via brms in R and then use the generated Stan code with pystan - or any of the other Stan-tie-ins in other languages such as MathematicaStan or MatlabStan - you can then also use it in Stan). In brms there's the me() function for predictors with measurement errors and a quite nice range of modeling options for non-linear models (it depends a bit of what exactly you are thinking about whether that's covered) and it also supports Gaussian processes, but I'm not sure from what you described to what extent you can fit it all together the way you want (if not you may be best off looking at the - usually quite well-chosen / efficient - Stan code that brms generates and having to fit it together exactly the way you want manually in Stan).

Björn
  • 21,227
  • 2
  • 26
  • 65
  • 2
    Nice, I think I had come across Stan before but didn't realize it might have a straightforward implementation for this problem. A relevant section from docs seems to be [Measurement Error and Meta-Analysis](https://mc-stan.org/docs/2_26/stan-users-guide/measurement-error-and-meta-analysis.html) which gives an example for [linear regression](https://mc-stan.org/docs/2_26/stan-users-guide/bayesian-measurement-error-model.html). – Sterling Mar 26 '21 at 22:27
  • I agree that it's not obvious how hard it would be to fit everything together in Stan. Something a bit closer than the linear regression tutorial is [Predator-Prey Population Dynamics: the Lotka-Volterra model in Stan](https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html) from [Stan Case Studies](https://mc-stan.org/users/documentation/case-studies.html) which uses parametric differential equations and incorporates measurement error. – Sterling Mar 26 '21 at 22:36
  • I'd try brms first, if that does not get you there, then going to coding a model up yourself in Stan is a serious option, but especially when things get complicated, it can take some time (e.g. finding the right parameterization can be quite important). – Björn Mar 26 '21 at 22:41
  • I have no practical experience with R (lots with MATLAB/Mathematica, some with Python), so it's been a bit rough. I [posted](https://discourse.mc-stan.org/t/non-parametric-multidimensional-regression-with-measurement-error-in-predictors-and-responses/21528) on the stan discourse page. Giving it a shot though – Sterling Mar 27 '21 at 09:25
  • Unfortunately, still haven't quite gotten this to work.. – Sterling May 12 '21 at 22:21