4

I've got a reasonably large set of data from survey I got people to complete. After running a factored Shapiro-Wilk test on the data, the results came out that 86 of the 90 variable sets are statistically significant at p < 0.05 (and 89 at p < 0.1). The plots of each of these 90 variables shows they are left-skewed.

When I mentioned it to my supervisor, he told me that I shouldn't normalise the data (transforming the data to normality), but instead conduct nonparametric testing (H-test, U-test, etc.) However, the thesis review committee has asked me to put in a piece on why I did not normalise (transforming the data to normality) the data.

Could anybody suggest any arguments for or against data normalisation (transforming the data to normality), as well as any suggested articles to read?

Mike Hunter
  • 9,682
  • 2
  • 20
  • 43
Clauric
  • 143
  • 5
  • 2
    By "data normalization" do you mean "transforming to normality"? (Note that "normalization" typically means something else; in particular, if you look at the `normalization` [tag-wiki](https://stats.stackexchange.com/tags/normalization/info), that explicitly says it's NOT about transforming data; while the term is sometimes used that way in some application areas, it's overall less common than the other senses of *normalization* and there are less ambiguous/overloaded terms to convey that meaning. – Glen_b Aug 28 '17 at 12:34
  • @Glen_b changed tags to reflect your point – Clauric Aug 28 '17 at 12:36
  • 4
    What's a 'factored' Shapiro-Wilk test? – Glen_b Aug 28 '17 at 12:55
  • 1
    I wonder whether these methods are appropriate. What kinds of answers are people giving to a survey that *ought* either to have a Normal distribution or be transformable to a Normal distribution? Is it asking them to supply numerical measurements of 90 things? – whuber Aug 28 '17 at 14:23
  • 1
    It's a standard convention and rule of thumb to assume normality wrt rating scaled survey data but there are many examples of how these scales can diverge from this assumption. For instance, Norman Cliff in his book *Analyzing Multivariate Data* describes the biases that can emerge as a function of the psychometric properties of the scale being used, e.g., 5-point, Likert type, 7-points, 10 points or scales up to 100 points. These include skewed and lumpy distributions (esp. for 100 point scales), end effects when the wording of the scale anchors are too vague, the use of neutral points, etc. – Mike Hunter Aug 28 '17 at 14:31
  • In this case, the research was done across a combination of demographic and Likert scales. There were 53 questions in total, but they were measuring 15 major items, and 15 control factors. When the hypotheses were created, there turned out to be 90 items being measured, based on combinations of control factors, and major items. – Clauric Aug 28 '17 at 14:38
  • 3
    Related : [On the utility of the Shapiro-Wilk test for testing normality of data](https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) , [On using t-tests even when your data aren't normally distributed](https://stats.stackexchange.com/questions/9573/t-test-for-non-normal-when-n50?noredirect=1&lq=1), [On deciding between parametric and non-parametric tests](https://stats.stackexchange.com/questions/121852/how-to-choose-between-t-test-or-non-parametric-test-e-g-wilcoxon-in-small-sampl?noredirect=1&lq=1) – Barker Aug 28 '17 at 16:34
  • I am unable to comprehend the problem unless you specify your objectives. Data analysis and interpretation can nor be invoked the way you visualize here! –  Aug 31 '17 at 12:01

1 Answers1

4

Historically, statistics has grown up and developed based on assumptions of Gaussian normality and its ubiquity in the form of the bell-shaped curve, with a rich and wide-ranging set of methodologies unfolding out of that assumption. There are many reasons for these developments which are well articulated in Hastie and Efron's recent book, Computer Age Statistical Inference. One consequence of this assumption of ubiquitous normality is that deviance from it -- outliers -- are viewed as a problem to be solved by normalizing, transforming, and/or deleting the extreme values using techniques such as trimming and winsorizing or transformations such as the natural log, Lambert's W function, the inverse hyperbolic sine, and others in an effort at forcing the pdf to conform to normality.

Robust, nonparametric models are another, less widely employed set of methodologies in the statistical toolkit for dealing with nonconforming data. However, these approaches are less well understood by unsophisticated practitioners or, better, practitioners whose understanding begins and ends with Gaussian assumptions. Inevitably, this includes those on the dissertation committees of many hapless graduate students. One consequence of this lack of understanding is that, not surprisingly given the predominance of Gaussian assumptions, robust solutions are significantly less rich and well developed compared to historically earlier, more parametric and traditional approaches.

Both of these "approaches" suffer, if you will, from assuming that Gaussian normality is the "correct" view of nature and behavior in spite of its irremediable flaws. These flaws have to do with the just-as-ubiquitous fact that extreme values and/or large deviations from normality are not outliers but empirical realities. Mandelbrot and Taleb, in their paper Mild vs. Wild Randomness, (published in The known, the unknown, and the unknowable in financial risk management : measurement and theory, Princeton University Press, 2010), note that it is possible to shift one's viewpoint from assumptions based on smooth, Gaussian bell-shapes to assumptions that exceptional extreme values, jumps and discontinuities conform more closely to reality (than normality) and can be taken as the starting point for theoretical development. Their view inevitably relegates normal, ordinary data -- the mass of information in the pdf -- to a significantly less consequential role.

Their paper is a good introduction to extreme value theory (EVT), one of the least well-known and understood subdisciplines in statistics. Most importantly for the OP, EVT offers a completely different approach to thinking about and dealing with nonnormal data.

Mike Hunter
  • 9,682
  • 2
  • 20
  • 43