Robust standardization of data

Question

I have some data where I want to determine whether the shape of the probability distribution has changed compared to 10 years ago.

One example is that I have for various automobiles multiple measures of price at a given point in time from different used car dealers and online sellers.

make, model, year of manufacture, price

(there are further complications such as condition etc, but I've simplified the problem somewhat).

The data has a number of outliers. Is it possible to do a robust (to clarify, by robust I mean resistant to outliers) transformation that preserves the original distribution of values, but allows a comparison of probability distributions (with different means / sds) on a similar or identical scale such as [0,1].

Also, if I had an hypothesis such that maybe the distribution of prices had a different shape in the past. Is there any sensible way to combine the data from different makes and models so as to get a more accurate estimation of the shape of the distribution function, or is this completely nonsensical.

I can of course individually compare the same make, model, yr of manufacture but I'm looking at a large number of comparisons.

user603 · Answer 1 · 2013-08-20T19:01:52.240

Of course, it is possible to rescale your multivariate data so as
to express it in some standardized basis. It is also possible to do that in a way that preserves the distribution of your data (read the correlations between your original variables). Finally it is also possible to do this standardization in a way that is not influenced by the outliers. Furthermore, you will get the added benefit that it will also reveal potential outliers as standing out from the main body of your data.

denote $\pmb X$ your $n$ by $p$ data matrix. Let $(\hat{\pmb\mu},\hat{\pmb\varSigma})$ be respectively, the robust estimate of location and scatter of $\pmb X$. You can then simply transform your data as

$$\pmb Z=\hat{\pmb\varSigma}^{-1/2}(X-\pmb1_n'\hat{\pmb\mu})$$

where $1_n$ is a $n$-vector of ones and $\pmb D^{-1/2}$ denotes the matrix inverse square root of $\pmb D$. This transformation preserves the distribution of your data, standardizes it and makes the outliers stand out.

To compute $(\hat{\pmb\mu},\hat{\pmb\varSigma})$, there are several algorithms, but they all do essentially the same thing. The simplest, oldest is the FastMCD. You will find an R (open source), good implementation of it (as well as that of some other competitors) here. Make sure to read the vignette: it's very well done.

You can of course also use $(\hat{\pmb\mu},\hat{\pmb\varSigma})$ to derive a measure of outlyingness in $[0,1]$ associated with each observations (telling you how much that observation departs, in the multivariate space, from the bulk of the data). The way to compute such an outlyingness index is to do:

$$\text{Out}(\pmb x_i)=\left(1+(\pmb x_i-\hat{\pmb\mu})'\hat{\pmb\varSigma}^{-1}(\pmb x_i-\hat{\pmb\mu})\right)^{-1}$$

where $\pmb x_i$ is the i-th observations. You can use this index as a form of center outward ranking of your data. Of course, this measure will also not be influenced by the outliers and so you can also use $\text{Out}(\pmb x_i)$ to reveal them (e.g. the outliers will have stand out by having a very low values of $\text{Out}(\pmb x_i)$).

Thanks for your comment. Actually I was originally thinking about the univariate case, but I should research and think some more about the multivariate approach. By the way what is the matrix D ? I've probably overlooked some keypoint. Thanks. — Antonio2100, Aug 22 '13 at 10:08
Sure. Just know that the 'one variable at a time' approach changes the correlation between the variables. You can use bivariate plots to convince you of this. If the correlation structure is meaningful to you (presumably it is, otherwise why collect all these variables?) I wouldn't standardize each variable independently. if $D$ is a given matrix, $D^{-1/2}$ is is the square root inverse of $D$ (I’ve seen people interpret $D^{-1/2}$ as an element-wise operator on the entries of $D$ so I write explicitly what it mean, in case). — user603, Aug 22 '13 at 10:18

Nick Cox · Answer 2 · 2013-08-20T16:09:57.067

3

The only standardizations that ensure the same limits are those based on the minimum and maximum such as (value $-$ minimum) / (maximum $-$ minimum) but these would usually be considered far from robust, if robust here means resistant to outliers, which is far from the only sense.

(value $-$ mean) / SD is very commonly used, but again that is not robust in the usual sense. (value $-$ median) / IQR is more robust.

You may be working with some other definition of robust, in which case you would need to spell it out here.

Your example of a hypothesis, "the distribution of prices had a different shape in the past", is rather general.

On a variety of grounds, I suspect that your problem would be simpler if you used a quite different approach and looked at logarithm of price. That should pull in outliers. That's not completely arbitrary because it is so common to talk of prices in general rising (sometimes falling) by some percentage, which corresponds to a shift of log price by the corresponding constant. Much here depends on whether you, and especially your likely readership, are comfortable with thinking on a logarithmic scale.

An alternative used by Howard Wainer for car prices is the reciprocal. An accessible summary of his article can be found at http://mahalanobis.twoday.net/stories/908248/

I'd expect that looking at a variety of graphs (histograms or even better quantile plots) would help to see what kinds of changes there have been to the distribution over time.

edited Aug 20 '13 at 16:09

answered Aug 20 '13 at 01:01

Nick Cox

48,377
8
110
156

Thanks for your suggestions. I was thinking about doing something like a Z-score but with the median and Median Absolute Deviation, but I'm not sure. I clarified the question so that by robust I mean insensitive to outliers. I agree the hypothesis is rather vague ! – Antonio2100 Aug 20 '13 at 01:37
MAD will probably work in about the same way as IQR. Expect MAD to be about half IQR, but skewness could mess that up. – Nick Cox Aug 20 '13 at 12:46
@Antonio2100: if you do univariate $z$ scores, you change the multivariate shape of your data. To preserve the multivariate shape of your data you should do a multivariate standardization (see my answer). – user603 Aug 20 '13 at 15:53
On the use of the log transform to 'pull the outliers'. This doesn't solve the problem. See the discussion [here](http://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va/3530#3530) for more info. – user603 Aug 20 '13 at 15:59
@user603 Like you, I probably didn't answer the question in quite the way the OP may have expected. I take the real problem as being to understand the data and I suspect that transforming the data is just as likely to be as useful, and easier for some people to understand, compared with what you propose. Your solution is very interesting but this comment seems a little dogmatic to me. – Nick Cox Aug 20 '13 at 16:07
@NickCox: thanks for the comment. I'm not really sure I understand. To standardize a multivariate distribution, one doesn't proceed a column at a time: doing this will change the multivariate shape. If you think my answer doesn't do what the OP wants, leave a comment below my answer: hopefully I can improve it. – user603 Aug 20 '13 at 16:13
It all depends what the problem is. OP asked for univariate standardisations to a shared scale. You propose something different. I propose something else, also different. I don't rule out, in advance, that one is a useful solution to understanding the data and the other isn't. – Nick Cox Aug 20 '13 at 16:20
@NickCox: where in the OP's question do you see that the OP asks for univariate standardization? – user603 Aug 20 '13 at 16:28
That was originally my inference, but with questions at this level univariate s. I take to be clearly implied. But it's explicitly confirmed by OP's reference in his comment to z-scores and (value - median)/MAD, which was posted above before you posted your answer. – Nick Cox Aug 20 '13 at 16:33
@NickCox: I read that, but I still don't see that it involves univariate normalization: after all, to obtain the Z score corresponding to a draw from $\pmb x_i\sim\mathcal{N}_l(\pmb \mu,\pmb \varSigma)$ ones does $\pmb \varSigma^{-1/2}(\pmb x_i-\pmb \mu)$. – user603 Aug 20 '13 at 17:32
Now I reverse the questions. Who said anything about normalization? You are thinking multivariate; OP was not; you think OP should be. – Nick Cox Aug 20 '13 at 17:49
@NickCox: thanks (we will probably have to delete all these comments once we clear this out...). It is possible that I did not understand but when the OP wrote 'make, model, year of manufacture, price' I immediately thought of a n by 5 matrix of data. – user603 Aug 20 '13 at 18:14
The issue, if there is one, is generic. Someone with data poses a question. Sometimes there are different ways of approaching the same question, which someone more statistical regards as deeper, more fruitful or more correct. The multivariate view has to be acquired.... – Nick Cox Aug 20 '13 at 18:17
OK: I think I understand now (thanks for having spelled your views out). The problem I have with pointing the OP in the direction of trying univariate outlier detection is that by doing this, one changes the correlation in the data, which is what one most often wants to study in the first place. The MV approach is a bit more sophisticated but it is the way to go (if, as I do, one understand the underlying problems in terms of modeling the pattern of correlations describing the bulk of the data). – user603 Aug 20 '13 at 18:21
That's for the OP to comment on, but often the focus is shift in distributions over time without tracking individuals. Bear in mind that there's no guarantee that all makes and models are represented in all years, which would constrain a multivariate approach. – Nick Cox Aug 20 '13 at 18:26
This was an interesting discussion and I hope it isn't deleted ! Nick is right as I was originally thinking about my data in the univariate sense, however in light of user603's comments, I will think some more about multivariate approaches. – Antonio2100 Aug 21 '13 at 10:22
My original intention was that I would standardize the price data for each make / model / yr of manufacture and look at the distribution either individually (qq-plots, kernel density estimation, etc) or possibly by just combining all the standardized observations (confused about whether this is valid). And then do the same for the older data and look for differences. I actually have another similar dataset where it might be easier to look at matched cases. In any event I was thinking that with the rise in online sales there might be a reduction in price variability. Thanks for your comments. – Antonio2100 Aug 21 '13 at 10:30

Tim · Answer 3 · 2013-08-28T04:35:42.063

1

No, it is not possible to do a robust transformation that preserves the original distribution of values. This is because any robust transformation (e.g., Winsorizing) will necessarily change the distribution (e.g., if the highest and lowest parameters are removed, then this changes both the shape of the distribution and the degree of variation).

A simple way of achieving your desired outcome is, within each group that you wish to compute the distribution, compute z = floor(5 * (x - min) / (max - min + 0.000001)) and then compare the proportion of observations in each of the 5 categories. Of course you could use more or fewer than 5 categories if you so wanted.

If you were concerned about a small number of very extreme observations you could first Winsorize the data.

edited Aug 28 '13 at 04:35

answered Aug 20 '13 at 14:32

Tim

3,255
14
24

The question is about standardization, not transformation. I'm happy to regard standardizations as linear transformations, but you need to make a case that the reverse is true, namely that transformations such as the Winsorizing you name are standardizations. I'd suggest that's at odds with typical usage in statistics. – Nick Cox Aug 20 '13 at 15:30
Note that I do propose a transformation in my answer. That's presented as an alternative, not a standardization. – Nick Cox Aug 20 '13 at 16:09
Thanks for pointing out the broken link. The question asks about 'transformation' not standardization. And, standardization of a variable doesn't really fix problems with the shape of the distribution anyway. My proposed method is essentially a robust empirical estimate of the distribution function, so it feels pretty much in line with the typical usage in statistics to me, but there is always a gulf of guessing between what the person asks and what we answer. ;) – Tim Aug 28 '13 at 04:39
I think we can agree that the desiderata stated by the OP are contradictory: preserving the original distribution of values and transformation are antagonistic. – Nick Cox Aug 28 '13 at 07:58

Robust standardization of data

3 Answers3