Transforming positively skewed data with positive and negative values

Question

My data (see image below) is positively skewed and contains both positive and negative values. I'd like to transform it to achieve normality so I can apply a repeated measures ANOVA, but can't find a technique that works.

I've considered the following options:

Square-rooting: doesn't keep the sign
Square-rooting then making the original negative values negative: makes my data bimodal
Natural log transform: can't do this with negative values
Add a constant then log transform: this arbitrarily leaves my lowest value as an outlier
'Neglog' transform (Cox, 2005; also Whittaker et al. 2005): makes my data bimodal
Outlier removal (2.5 SD): doesn't fix normality, but an ANOVA on this data has similar results to an ANOVA on the raw data (for better or worse).
Perform the non-parametric equivalent (Friedman test) and compare the results: will do this today

My supervisor suggests I should stick with the outlier removal though the data is still visibly skewed, and 'see what the reviewers suggest' but I just wanted to check here in case there was anything I was missing.

Original data:

With outliers (2.5 SD) replaced with the value at that criterion. I later decided to just exclude the outliers, so imagine the highlight data points in each condition removed.

There may be novel and interesting scientific results lurking in the upper end of your data distributions. Suppressing them in the analysis, either by transformation or--far worse--outright removal as "outliers," would seem counterproductive and contrary to the scientific impulse towards discovery. Instead of working to process your data in some Procrustean manner so that they will fit an intended formal analytical procedure (*i.e.*, ANOVA), why not *explore* them to find out what might actually be going on? (cc @Nick, in recognition of related points in your masterful answer). — whuber, Aug 11 '17 at 14:53
Thanks WHuber. I know why those intercept values are so high - they were in combination with rather shallow slopes, which I have also analysed. I want to find out here whether these intercepts differ across modalities, they're not supposed to. I've conducted a Friedman test on the original data along with Wilcoxons and Bayesian t-tests. However I'm concerned about publishing my first figure along with a non-parametric test. My supervisor would much rather exclude people in order to use an ANOVA and have a slightly less skewed figure. — E_Williams, Aug 11 '17 at 16:30
None of your supposed outliers is strongly such. I'd compare anova on raw data and on transformed data. Your supervisor's suggestions seem ill-advised to — Nick Cox, Aug 11 '17 at 16:57
I have to find it engaging that your link for neglog is a document I wrote (although my name is not on it). For anyone wanting to know more, Whittaker, J., J. Whitehead and M. Somers. 2005. The neglog transformation and quantile regression for the analysis of a large credit scoring database. _Applied Statistics_ 54: 863-878. is an explicit reference. — Nick Cox, Aug 17 '17 at 12:28
I've just found your name at the end of the document, apologies for not citing (updated now). I'll double-check the neglog transform and edit my question to include results with figures. Thanks for editing your answer, I didn't get a notification. — E_Williams, Aug 18 '17 at 15:02
Thanks for the thanks, but I don't know whether OPs are notified of edited answers. It's listing the intercepts that would allow experiment: I guess there are just about 100. — Nick Cox, Aug 18 '17 at 15:14
https://ideas.repec.org/c/boc/bocode/s370102.html gives bibliographic details for my document. The version there was last revised in 2007. — Nick Cox, Aug 18 '17 at 17:44

Nick Cox · Accepted Answer · 2017-08-18T18:45:56.967

Preserving the order of values and preserving their signs too are both important in almost all transformations. Unless the zero point of measurement is arbitrary, the distinction between positive and negative values is qualitatively as well as quantitatively important and should be kept to ease scientific and substantive interpretation.

There is some hint here of firing a shotgun to see if you manage to hit the target somehow.

The presumption here is that the marginal distribution of the response or outcome variable should be normal, but even for analysis of variance that is unlikely and not an assumption or even an ideal condition. For analysis of variance it is an ideal condition, at most, that errors perturbing the model structure are normally distributed.

I'd be immensely more concerned with whether transformation makes the data closer to (or further away from) additivity and equal variability, more important ideal conditions. Some disciplines practise paranoia at anything with the slightest whiff of non-normality (and also neglect more important issues).

What I can think can be ruled out absolutely are

Square rooting, for the reason you give: it doesn't keep the sign (and also the order is not even maintained). (Your statement seems to imply that you rooted the absolute values. An alternative, sign(value) * root(abs(value)) would solve that problem.)
Natural log transform, for the reason you give: you can't do this with negative values

What I advise against as a matter of judgment and experience are

Add a constant then log transform: you state that this arbitrarily leaves your lowest value as an outlier, but that is contingent on the unstated constant used. More crucially, I have never seen this used where it seems natural scientifically. (At best, log($x +\ $smidgen) works acceptably for $x \ge 0$.)
Outlier removal (2.5 SD): you state that ANOVA on these data yields similar results to an ANOVA on the raw data. Hence the highly arbitrary removal can be avoided. Note that there are many, many threads here on outliers and many ideas on how to deal with them, but also a strong consensus that removing outliers because they are awkward is a very poor approach statistically. See e.g. here.

What remains plausible so far as I can see:

'Neglog' transform (Whittaker et al. 2005): I am not unduly perturbed by a report that it makes your data bimodal
Perform the non-parametric equivalent (Friedman test) and compare the results: that is a common check, although in my view overrated because of the very limited quantitative inferences allowed.

I'd add cube root transformation as respecting sign too. More on that here.

EDIT 1: I'd like to have your data as well as your graphs, but the graphs make clear that you are excluding in terms of each modality's mean and SD. That's got to seem arbitrary, when excluding in terms of the overall mean and SD is also possible.

Just to illustrate the real (or apparent) problems I simulated a skewed distribution which has approximately the same range as your raw intercepts. (The details of the forgery should not be important, but I drew from a beta distribution with shape parameters 1 and 3 and shifted and stretched to get limits about $-$150, 620.)

Quantile plots of the raw "data", neglog transformed and cube root transformed are given here:

The plots suggest

Outliers (which aren't extraordinary at all in my view) are drawn in helpfully by either transform.
The marginal distributions that result are indeed bimodal, because each transformation is steepest at zero, so values near zero get stretched apart relative to others.
Cube root is a little gentler than neglog.

I see no reason to suppose that the gain of #1 is outweighed by the shape change of #2. Indeed, large intercepts of either sign seem likely to have large standard errors and each transformation may help in that respect.

The key point is that it doesn't seem out of order to use mean-based summaries such as ANOVA on data like these, with or even without transformation. A generalised linear model with appropriate link is a good way to proceed, although you may have to write some extra code.

The five points labelled on each vertical axis are the maximum, upper quartile, median, lower quartile and minimum in each case. As grid lines are also provided for cumulative probabilities 0(0.25)1, it is possible to trace quartile-based boxes such as would appear on a conventional box plot.

EDIT 2: Thanks for posting a copy of your data.

Some experiments indicate:

With your data cube root is distinctly gentler than neglog, because your data are more non-normal than my simulated data. It works well to pull in moderate outliers and reduce skewness. The plots put median and quartiles boxes on top of quantile plots, so-called quantile-box plots.

Repeated measures ANOVA is fairly robust insofar as P-values are scientifically similar for raw data (with no outlier removal) and cube roots.
Scatter plots of pairs of modalities seem easier to think about too. The separation into positive and negative clumps is fortuitous (presumably nothing rules out intercept estimates that are very close to zero) but may be instructive.

General hint on graphing: As is common with logarithmic scales, I'd advise that you label axes with numbers on your original scale, to help make clear where the transformation stretches and squeezes.

Disclaimer: I have no expertise in your field and cannot advise on scientific interpretation.

Thanks for your helpful comments - I've edited my question to include 2 figures. I'm working towards a deadline today, but can provide more figures with the outcomes of the suggested tests tomorrow. — E_Williams, Aug 11 '17 at 12:02
(a) I've corrected 'squaring' to 'square-rooting' in my question. (b) I agree, that ANOVA assumption is that the errors are normally distributed. But aren't errors just the values minus the mean? Surely this would lead to an identical distribution? (c) I've uploaded my data [here](https://github.com/EA-Williams/hello-world/blob/master/Intercept%20Data.csv). I may have transformed incorrectly due to translating the stata code poorly. (d) I'm happy to use an ANOVA if appropriate. With this being my first paper, I was trying to anticipate reviewers comments upon seeing my first figure. — E_Williams, Aug 18 '17 at 15:33
(e) It had never occurred to me to use the grand mean and sd when calculating 2.5 sd outliers, I always did this at the univariate level! In another section of my paper I'm excluding 2.5sd univariate outliers for a correlation, due to the sensitivity of correlations. Do you have any advice on alternative ways to deal with this sensitivity? I was thinking of using the mahalanobis distance. (f) Thanks, of course! — E_Williams, Aug 18 '17 at 15:38
(a) Curses. Corresponding edits in my answer. (b) In practice, the means differ, so no. (e) My stance is that all outlier removal methods are obnoxious unless based on independent evidence that an outlier value is absolutely wrong or an outlier is irrelevant to the problem (it's really a beast of a different species, or whatever). I can't comment on Mahalanobis distance without knowing how that would be used to help. Transformation beats outlier removal any day. The pitfall of SD and mean methods is that both are heavily influenced by outliers any way! — Nick Cox, Aug 18 '17 at 19:01
Thanks again for your A+ response. I think I'll either neglog or cube root, then footnote that the results are similar sans transform. (b) I'm not sure I understand. I have 3 different distributions, and if I take the mean of each from each, won't I end up with the same 3 distributions but shifted over? i.e. no change in shape? (e) It's common practice in psychology to exclude 2.5 sd outliers, though this of course has its problems. The correlations I'm running aren't with the intercept values, so the gaps around zero aren't a problem. — E_Williams, Aug 18 '17 at 20:26
If you have any knowledge about the chi-squared test, would you consider looking at my other open question? It's lost steam since my edit to correct the research question. — E_Williams, Aug 18 '17 at 20:37
(b) Three normals with different means won't necessarily produce a normal distribution when combined. We agree that it's conditional distributions, not the marginal distribution of the response, that matter. (e) I'm shocked; can you give good textbook or review paper references from psychology explaining why that's supposed to be good practice? — Nick Cox, Aug 19 '17 at 01:43
Sorry for the delay. (b) How can I calculate whether the errors are normally distributed? (e) It looks like the 2.5 SD criterion is mainly used with reaction time (RT) data. One of my supervisors works with RTs and has often suggested I apply it to my non-RT datasets. After our discussion (you and I) I feel more confident in displaying my original data and presenting the related ANOVA. I can also footnote to say whether results are similar when cube rooted / using a non-parametric test. — E_Williams, Aug 25 '17 at 14:49

Transforming positively skewed data with positive and negative values

1 Answers1

Linked

Related