Are inconsistent estimators ever preferable?

Question

Consistency is obviously a natural and important property of estimators, but are there situations where it may be better to use an inconsistent estimator rather than a consistent one?

More specifically, are there examples of an inconsistent estimator which outperforms a reasonable consistent estimator for all finite $n$ (with respect to some suitable loss function)?

There is an interesting tradeoff in performance between consistency of model selection and parameter consistency in estimation problems using the lasso and its (many!) variants. This is detailed, e.g., in Bühlmann and van der Geer's recent text. — cardinal, Jun 25 '12 at 18:46
Wouldn't the argument in my, now deleted, answer still hold? Namely: in small samples it is better to have an unbiased estimator with low variance. Or can one show that an consistent estimator always has lower variance than any other unbiased estimator? — Bob Jansen, Jun 25 '12 at 18:53
Perhaps, @Bootvis! Do you have an example of an inconsistent estimator with low MSE? — MånsT, Jun 25 '12 at 18:56
@Bootvis: If you happen to look at the extensive comments on an answer to a recent question asking about consistency vs. unbiasedness, you will see that a consistent estimator can have arbitrarily wild behavior of both the variance and bias (even, simultaneously!). That should remove all doubt regarding your comment. — cardinal, Jun 25 '12 at 18:58
I thought I had from one of two books but apparently I was wrong about that too! The example is nowhere to be found. @cardinal: Sounds interesting, will check it out — Bob Jansen, Jun 25 '12 at 18:58
MånsT: The answer might depend on what is meant by "reasonable". E.g., a multiple of the sample mean like $\left(1+10^6/\sqrt{n}\right)\bar{x}$ is a consistent estimator of the mean but is easily outperformed (under many loss functions) by inconsistent estimators! — whuber, Jun 25 '12 at 19:01
The accompanying research might not be published just yet, but I remember seeing a talk by Clintin Davis-Stober about how, with sample sizes common in Psychological research (between 10 and 40, usually), randomly determined (and, thus, inconsistent) beta-weights in regression had lower MSE than OLS regression. Naturally, OLS did better when larger samples were used, but it still raised some interesting questions. — Jonathan Thiele, Jun 29 '12 at 18:14
@JonathanThiele: It is not quite clear what your comment is saying. Clearly, the *in-sample* MSE cannot be lower for randomly drawn $\hat\beta$, essentially by definition of the optimization problem OLS is solving. If that evaluation is *out-of-sample*, there are a whole host of reasons why that might be true in some instances. Depending on what "randomly determined" means, it could be that the fixed zero vector gets lower MSE, too, for example. — cardinal, Jun 29 '12 at 18:48
@cardinal: I'm pretty sure that it was in-sample MSE, because I do not remember any reference to a training group, jackknifing, or bootstrapping. As for the randomly sampled terms, I don't quite remember the distribution that they are sampled from. I will admit that I may have mistaken a random-weights model for some other form of improper regression, but I do remember that the model being used was at least inconsistent and still outperformed OLS with a small sample with large (+/- .3 or more) correlation among the predictors. — Jonathan Thiele, Jun 30 '12 at 15:32
Dana and Dawes (204) showed that, in terms of out-of-sample prediction, correlation weights and unit weights outperform OLS regression weights unless both sample size and true population R2 are both large. They did not explore random weights. Dana, J., & Dawes, R. M. (2004). The superiority of simple alternatives to regression for social science predictions. Journal of Educational and Behavioral Statistics, 29(3), 317-331. — Ed Rigdon, Apr 22 '20 at 12:52
Great question (and at least one great answer)! Here is a closely related one: ["Are inconsistent estimators ever preferable? A twist"](https://stats.stackexchange.com/questions/464484/). — Richard Hardy, May 04 '20 at 17:11

whuber · Accepted Answer · 2012-06-25T20:37:15.330

This answer describes a realistic problem where a natural consistent estimator is dominated (outperformed for all possible parameter values for all sample sizes) by an inconsistent estimator. It is motivated by the idea that consistency is best suited for quadratic losses, so using a loss departing strongly from that (such as an asymmetric loss) should render consistency almost useless in evaluating the performance of estimators.

Suppose your client wishes to estimate the mean of a variable (assumed to have a symmetric distribution) from an iid sample $(x_1, \ldots, x_n)$, but they are averse to either (a) underestimating it or (b) grossly overestimating it.

To see how this might work out, let us adopt a simple loss function, understanding that in practice the loss might differ from this one quantitatively (but not qualitatively). Choose units of measurement so that $1$ is the largest tolerable overestimate and set the loss of an estimate $t$ when the true mean is $\mu$ to equal $0$ whenever $\mu \le t\le \mu+1$ and equal to $1$ otherwise.

The calculations are particularly simple for a Normal family of distributions with mean $\mu$ and variance $\sigma^2 \gt 0$, for then the sample mean $\bar{x}=\frac{1}{n}\sum_i x_i$ has a Normal$(\mu, \sigma^2/n)$ distribution. The sample mean is a consistent estimator of $\mu$, as is well known (and obvious). Writing $\Phi$ for the standard normal CDF, the expected loss of the sample mean equals $1/2 + \Phi(-\sqrt{n}/\sigma)$: $1/2$ comes from the 50% chance that the sample mean will underestimate the true mean and $\Phi(-\sqrt{n}/\sigma)$ comes from the chance of overestimating the true mean by more than $1$.

Losses

The expected loss of $\bar{x}$ equals the blue area under this standard normal PDF. The red area gives the expected loss of the alternative estimator, below. They differ by replacing the solid blue area between $-\sqrt{n}/(2\sigma)$ and $0$ by the smaller solid red area between $\sqrt{n}/(2\sigma)$ and $\sqrt{n}/\sigma$. That difference grows as $n$ increases.

An alternative estimator given by $\bar{x}+1/2$ has an expected loss of $2\Phi(-\sqrt{n}/(2\sigma))$. The symmetry and unimodality of normal distributions imply its expected loss is always better than that of the sample mean. (This makes the sample mean inadmissible for this loss.) Indeed, the expected loss of the sample mean has a lower limit of $1/2$ whereas that of the alternative converges to $0$ as $n$ grows. However, the alternative clearly is inconsistent: as $n$ grows, it converges in probability to $\mu+1/2 \ne \mu$.

Loss functions

Blue dots show loss for $\bar{x}$ and red dots show loss for $\bar{x}+1/2$ as a function of sample size $n$.

(+1) Your comment **"consistency is best suited for quadratic losses"** interests me also but it's not blatantly obvious to me (and perhaps others) where that comes from. Clearly convergence in $L_2$ is best suited for quadratic losses and $L_2$ convergence implies convergence in probability but what is the motivation for this quote in the context of almost sure convergence a.k.a. "strong consistency"? — Macro, Jun 25 '12 at 20:45
@Macro The thinking is somewhat indirect and not intended to be rigorous but I believe it is natural: quadratic loss implies minimizing variance which (via Chebyshev) leads to convergence in probability. Whence, a heuristic for finding a counterexample should focus on losses which are so far from quadratic that such manipulations are unsuccessful. — whuber, Jun 25 '12 at 21:48
The inconsistent estimator may be better for many values of n but I would think that it would not maintain an advantage over a consistent estimator for sufficiently large n. — Michael R. Chernick, Jun 25 '12 at 22:21
I don't understand the basis of your comment, @Michael: look at the last graphic. The expected loss for the consistent estimator decreases to $1/2$ while that of the inconsistent estimator decreases (exponentially) to $0$: it is thus *exponentially* better than the consistent one as $n$ grows large. — whuber, Jun 25 '12 at 23:47
@whuber My comment was general and not specifically applying to your example. The idea is that if the consistent estimate has variance going to 0 the bias from the inconsistent estimator will eventually be larger. — Michael R. Chernick, Jun 26 '12 at 01:07
@Michael OK, thank you for explaining that. In this context, with a non-quadratic loss, an "advantage" is not expressed terms of bias. One might criticize this loss function, but I don't want to reject it outright: it models situations where, for instance, the data are measurements of an item manufactured to certain tolerances and it would be disastrous (as in Shuttle o-ring failure or business bankruptcy disastrous) for the true mean to fall outside those tolerances. — whuber, Jun 26 '12 at 02:24
(+1) Great answer, @whuber! I particularly like that it doesn't feel too pathological - I can think of many situations where this type of loss would be applicable. — MånsT, Jun 26 '12 at 07:06
Great answer! Here is a closely related question: ["Are inconsistent estimators ever preferable? A twist"](https://stats.stackexchange.com/questions/464484/). It would be very interesting to see your take on it, @whuber. — Richard Hardy, May 04 '20 at 17:12

AJKOER · Answer 2 · 2020-04-22T13:59:20.507

Here is a very real situation where an inconsistent estimators is preferable due to constraints on sampling.

I point to a variation of 'Importance Sampling' in Sampling theory would most likely constitute an inconsistent but improved estimator of the sample mean, where the correct percentage weighting of this class is not known (or, the subject of investigation), but it itself, is selected as 'the best available estimate'.

For example, take a poor country where a large percent of the population do not have bank accounts. Assume you were given access to spending data for those with accounts to develop figures for the nation as a whole. This would clearly represent closely the actual countries spending pattern, but because of the precise impact of unreported cash income and differing spending among those without bank accounts, this is not expected to completely 'consistent' with the countries actual total domestic spending.

The large size weighting of those with bank accounts clearly still makes it superior, albeit distorted, over the sampling variance expected in a simple random strategy scheme. Note, no matter how precisely one gathers the samples in the 'Importance Sampling' strata alone (so mathematically the estimate converges in probability to the this class's true value), it still remains inconsistent estimator for the parent population (as limitations on out-of-class sampling implies it cannot converges in probability to produce a combined estimator for the parent populations mean).

score 1 · Answer 3 · answered May 19 '20 at 01:07

I can't comment, so I will add this as an answer. Whuber answer is just showing that one specific inconsistent estimator can be better than another specific consistent estimator. Since the questions was: "are there examples of an inconsistent estimator which outperforms a reasonable consistent estimator for all finite n" then of course his answer is ok.

However, this answer may give readers the impression that one needs to use an inconsistent estimator, and this is clearly not the case here.

For instance, in Whuber's case we can then take the estimator to be the upper end of a confidence interval, which will only underestimate the true mean at a chosen significance level, and thus will be superior to the mean itself. This estimator is still consistent, since the upper end of the confidence interval converges to the true $\mu$ as the sample size increases.

score 0 · Answer 4 · edited May 04 '20 at 15:00

More specifically, are there examples of an inconsistent estimator which outperforms a reasonable consistent estimator for all finite n (with respect to some suitable loss function)?

Yes there are, and probably are more simpler and usual than you think. Moreover complex or unusual loss functions are not needed about that, usual MSE is enough.

The crucial concept here is bias-variance trade-off. Even in simple linear models setting, the wrong/misspecified model, that involve biased and inconsistent estimators for parameters and entire function, can be better then the correct one if our goal is prediction. Now, prediction is very relevant in real world.
The example is simple, you can think about a true model like this:

$y = \beta_1 x_1 + \beta_2 x_2 + \epsilon$

you can estimate several linear regression; a short like this:

$y = \theta_1 x_1 + u$

or longer that can also represent the empirical counterpart of true model. Now, the short regression is wrong (involve inconsistent and biased parameters and function) however is not sure that the longer (consistent) is better for prediction (MSE loss). Note that this story hold precisely in finite sample scheme, as you requested. Not asymptotically.

My point is clearly and exhaustively explained in: Shmueli - To explain or to predict - Statistical Science 2010, Vol. 25, No. 3, 289–310.

EDIT. For clarification I add something that, I hope, can be useful to the readers. I use, as in the article cited, the concept of bias in quite general way. It can be spent in both case: unbiased and consistent estimators. These two things different but the story above hold in both case. From now I speak about bias and we can spend it against consistency also (so, biased estimators = inconsistent estimators). The concept of bias are usually refers on parameters (let me refers on Wikipedia: https://en.wikipedia.org/wiki/Consistent_estimator#Bias_versus_consistency; https://en.wikipedia.org/wiki/Bias_of_an_estimator. However is possible to spend it more in general also. Suffice to say that not all estimated statistical models (say $f$) are parametric but all them can be biased in comparison to the true models (say $F$). Maybe in this way we can conflate consistency and misspecification problems but in my knowledge these two can be viewed as two face of the same coin.

Now the short estimated model (OLS regression) above $f_{short}$ is biased in comparison to the related true model $F$. Otherwise we can estimate another regression, say $f_{long}$ where all correct dependent variables are included, and potentially others are added. So $f_{long}$ is a consistent estimator for $F$. If we estimate $f_{true}$ where all and only the correct dependent variables are included we stay in the best case; or at least it seem so. Often this is the paradigm in econometrics, the field where I’m more confident. However in Shmueli (2010) is pointed out that explanation (causal inference) and prediction are different goals even if often them are erroneously conflated. Infact, at least if $n$ is finite, ever in practice, $f_{short}$ can be better than $f_{true}$ if our goal is prediction. I cannot give you an actual example here. The favourable conditions are listed in the article and also in this related and interesting question (Paradox in model selection (AIC, BIC, to explain or to predict?)); them come from an example like above.

Let me note that, until a few years ago, in econometrics literature this fact (bias-variance story) was highly undervalued but in machine learning literature is not the case. For example LASSO and RIDGE estimators, absent in many general econometrics textbooks but usual in machine learning ones, make sense primarily because the story above hold. Moreover we can consider parameters perspective also. In the example above $\theta_1$ come from the short regression and, taking apart few special cases, is biased in comparison to $\beta_1$. This fact come from the omitted variable bias story, that is an classic argument in any econometric textbooks. Now if we are precisely interested in $\beta$s this problem must be resolved but for prediction goals non necessarily. In the last case $f_{short}$ and therefore $\theta_1$ can be better than consistent estimators, therefore $f_{true}$ and its parameters.

Now we have to face a nuisance question. Consistency is an asymptotic property, however this not mean that we can speak about consistency only in theoretical case where we have $n=\inf$. Consistency, in any form, is useful in practice only because if $n$ is large we can say that this property hold. Unfortunately in most case we do not have a precise number for $n$ but sometimes we have an idea. Frequently consistency is simply viewed as weaker condition than unbiasedness, because in many practical case unbiased estimators are also consistent ones. In practice we often can speak about consistency and not about unbiased because the former can to hold and the last surely not, in econometrics it is almost always so. However, also in these case, is absolutely not the case that bias-variance trade-off, in the sense above, disappear. Idea like this is precisely the ones that leave us in dramatic errors that Shmueli (2010) underscore. We have to remember that $n$ can be large enough for some things and not for others, in the same model also. Usually we know nothing about that.

Last point. Bias-variance story, referred on usual MSE loss, can be spent also in another direction that is completely focused on parameters estimation. Any estimator have his mean and variance. Now, if an estimator is biased but have also lower variance than a competitor that is unbiased and/or consistent, is not obvious what is better. There is exactly a bias-variance trade-off, as explained in: Murphy (2012) - Machine Learning: A Probabilistic Perspective; pag 202.

You have an interesting point--but its relevance to the concept of "consistent estimator" in the question is doubtful. The issue with the "short" model is one of model mis-specification, not inconsistency. Indeed, *any* property of an estimator that holds in the finite-sample case cannot possibly be relevant to consistency, which is [by definition](https://en.wikipedia.org/wiki/Consistent_estimator) an *asymptotic* property. — whuber, Apr 22 '20 at 15:35
I see you point. However the concept of consistency and misspecification are related. In the example above $\theta_1$ is both biased and inconsistent compared to $\beta_1$, and the short model is both biased and inconsistent estimated function compared to the true model. The chance of wrong model asymptotically not holds I say this above. However, in practice, also the so called asymptotic properties are always used in finite sample situation and in most case is not clear how large $n$ must be for their correct application. — markowitz, Apr 22 '20 at 16:06
After all the same MånsT speak about any finite $n$ even if the question is about consistency. — markowitz, Apr 22 '20 at 16:06
You appear to be using terms like "biased" and "inconsistent" in senses that are not implied by the question. It's not our fault that these terms can have many meanings, but it is incumbent upon us to use them in clear and compatible ways, lest readers draw the wrong conclusions from your post. — whuber, Apr 22 '20 at 16:16
You are right, statistical terminology is not always clear if decontextualized. However MånsT’s question is quite general. I don't know if my reply encounter his interest. However my explanation seems me clear enough. I indicated also the reference in order to cut out any misunderstanding. — markowitz, Apr 22 '20 at 16:43
The question is about "consistent estimators." I supplied a link to the Wikipedia article on this concept. If you don't find that authoritative, please consult a textbook on estimation such as Lehmann & Casella. — whuber, Apr 22 '20 at 17:00
Tanks for suggestion. However I already know something about the meaning of consistent estimators and also the possibility to speak about several kind of consistency. Maybe you are worry about the fact that I focus on predictions more than parameters. If so I understand your point, however the two measure are related (see $\theta_1$ above). — markowitz, Apr 22 '20 at 17:33
I'm sorry--I did not intend to insinuate you did not know this material. I was making the point that the intended meaning of "consistent estimator" in this question is that of the references I gave and that meaning differs from one you implicitly adopt in your answer. — whuber, Apr 22 '20 at 19:06
@markowitz: it would be helpful if you either adopted the terminology used in my question (consistent with that of e.g. Lehmann&Casella) or clarified what you mean when you claim that the estimator in the short model is biased and inconsistent. An actual example where the short model outperforms the long one would also be a great addition to your answer. — MånsT, Apr 23 '20 at 07:38
For clarification I add something that, I hope, can be useful to the readers — markowitz, Apr 25 '20 at 15:06
@MånsT and markowitz, perhaps the notions of consistency found in my post ["T-consistency vs. P-consistency"](https://stats.stackexchange.com/questions/265739/) will be helpful. — Richard Hardy, May 04 '20 at 14:56
An interesting answer (still thinking it through)! Here is a closely related question: ["Are inconsistent estimators ever preferable? A twist"](https://stats.stackexchange.com/questions/464484/). — Richard Hardy, May 04 '20 at 17:14

Are inconsistent estimators ever preferable?

4 Answers4

Linked