Is data-likelihood-weighted regression a thing?

Question

Consider the basic linear regression model $y = A \theta$ with $y\in R^n$ and $A \in R^{n\times k}$ measurements and $\theta \in R^k$ parameters to be estimated. In my case, $\theta$ are physically meaningful parameters that I would like to estimate for further interpretation.

A situation that I encounter frequently is that $A$ and $y$ contain time series of different measurement modalities that are rather repetitive over time, with some rare "interesting" anomalies occurring from time to time. Basic linear regression will weight all measurements equally, hence leading to a model that explains the standard measurements (occurring frequently) well, but may fail during unusual events.

Consequently, I thought it rather natural to perform a weighted regression, where measurements are weighted by the inverse of their likelihood of occurrence, hence weighting unusual measurements stronger than with basic (unweighted) regression and hopefully leading to a model that is more generally applicable. More precisely, I would like to assign a weight $w_i = f(p(y_i, a_i|y, A))$ to each measurement, where $p(y_i,a_i|y,A)$ denotes the likelihood of a new measurement taking values $y_i$ and $a_i$, having observed all the data in $y$ and $A$, and $f(x)$ is a positive, monotonously decreasing function for $x>0$, e.g. $f(x)=1/x$. However, I seem to be unable to find any sources doing something like this, which made me wonder: Is this ...

a standard thing that people do and I'm simply unable to recognize it or find the right terms for searching for it,
completely unreasonable for some reason, or
reasonable but nobody has done it (seems unlikely)?

Disclaimer: I'm a mathematician / engineer / computer scientist and mostly self-taught statistician, so please bear with me if this is completely obvious...

Could you explain how the "likelihood of occurrence" is estimated or computed? And how do you account for the correlations among the components of $y$ in your fitting procedure? What is the purpose of this regression? — whuber, Jan 31 '18 at 22:32
You might want to read up on "leverage" in regression. Points in $A$ that are far from its mean will in fact potentially have quite an impact on the parameter estimates by standard estimation methods, so the broad approach you're suggesting is in fact more commonly used to give *less* weight to "unusual" values of the explanatory variables. — Peter Ellis, Jan 31 '18 at 22:40
@whuber Currently, the likelihood isn't estimated at all, as I'm still trying to understand whether the general approach is useful or not. But I guess one could do something like a multivariate kernel density estimation? Also, I don't (yet) account for the correlations among the components of $y$. Thanks for the pointer; I did some quick reading and guess that I should be doing something like Feasible GLS? The purpose of the regression is to estimate physically meaningful parameters from a simple linear model for further interpretation in other contexts. — jhin, Jan 31 '18 at 23:16
@PeterEllis Thanks, that is indeed interesting and I will take a deeper look at it to analyze the effect my measurements have on the outcome. However, the underlying issue remains that the overall influence of measurements in a certain range is determined by the amount of data in this range. This effect is what I would like to get rid of, in order to obtain a model that works equally well over the full data range. — jhin, Jan 31 '18 at 23:24
I'm still unable to determine what you might mean by "likelihood." Because that's crucial to understanding your question, editing it to clarify this would be most helpful. — whuber, Jan 31 '18 at 23:36
For people finding this in the future and wondering the same thing: This seems to be a more or less open research question, see my answer here: https://stats.stackexchange.com/a/365708/131402 — jhin, Sep 06 '18 at 20:44

score 1 · Answer 1 · answered Feb 01 '18 at 13:40

1

Something similar does occur, but I'm really not sure whether it would apply to your scenario. I'm not aware of weighting by an attempted probabilistic measure (what you call "likelihood of occurrence").

The similar procedures I'm referring to are to weight by relative frequency or by inverse-variance. This article gives examples of doing so for several different linear models.

answered Feb 01 '18 at 13:40

A. G.

2,011
8
17

1

Thanks, I had already stumbled upon similar sources myself. The problem, however, is that in these settings the variance of the individual measurements is always assumed to be somehow known *a priori*, and not something learned from the data themselves. – jhin Feb 02 '18 at 12:56

score 0 · Accepted Answer · answered May 13 '20 at 11:20

Turns out: yes, "data-likelihood-weighted regression" is a thing (although it is not called like this). There is actually a field of research that deals particularly with this question and has developed practically feasible solutions. It is called Covariate Shift Adaptation and has been popularized by a series of highly cited papers by Sugiyama et al., starting around 2007 (I believe). There is also a whole book devoted to this subject by Sugiyama / Kawanabe from 2012, called "Machine Learning in Non-Stationary Environments".

I will try to give a very brief summary of the main idea. Suppose your training data are drawn from a distribution $p_{\text{train}}(x)$, but you would like the model to perform well on data drawn from another distribution $p_{\text{target}}(x)$. This is what's called "covariate shift". Then, instead of minimizing the expected loss over the training distribution

$$ \theta^* = \arg \min_\theta E[\ell(x, \theta)]_{p_{\text{train}}} = \arg \min_\theta \frac{1}{N}\sum_{i=1}^N \ell(x_i, \theta)$$

as one would usually do, one minimizes the expected loss over the target distribution:

$$ \theta^* = \arg \min_\theta E[\ell(x, \theta)]_{p_{\text{target}}} \\ = \arg \min_\theta E\left[\frac{p_{\text{target}}(x)}{p_{\text{train}}(x)}\ell(x, \theta)\right]_{p_{\text{train}}} \\ = \arg \min_\theta \frac{1}{N}\sum_{i=1}^N \underbrace{\frac{p_{\text{target}}(x_i)}{p_{\text{train}}(x_i)}}_{=w_i} \ell(x_i, \theta)$$

In practice, this amounts to simply weighting individual samples by their importance $w_i$. The key to practically implementing this is an efficient method for estimating the importance, which is generally nontrivial. This is one of the main topic of papers on this subject, and many methods can be found in the literature (keyword "Direct importance estimation").

How can we now apply this to the particular question stated here, where one would like to learn a model that is not fitted to the training distribution, but instead works well over some target data space? Well, in this case design the target density $p_{\text{target}}$ to correspond to the region where the model should work well (choose e.g. the uniform distribution over some plausible region), estimate the training density $p_{\text{train}}$ from the data, and perform importance-weighted regression as explained above.

Is data-likelihood-weighted regression a thing?

2 Answers2