For my study design: When to log-transform data vs. when to use a non-parametric approach

Question

Edit Purpose of my study

I have weather stations collecting data inside and outside low-tech greenhouses. Four of the weather stations are inside, and one is outside. They are collecting temperature, humidity, solar radiation, wind speed, etc. I am testing to see if the differences between the weather station data inside and outside is statistically significant.

Because I have an unequal number of replicates inside and outside the greenhouses, I calculated the difference for each variable between each weather station inside each greenhouse and one weather station outside. This gives me a sample size of 4. Sometimes 3 because I lost a replicate for part of the study. I was hoping to test the significance of the differences from zero rather than the original weather station data.

The shapiro wilk test in R finds that none of my data are from a normal distribution. I am debating whether or not to transform the data and use a t-test or use a non-parametric test. I am leaning towards the latter.

Question Should I log-transform the data and run the t-test or should I find another non-parametric test to use? What are the pros and cons of each? I have done a bit of research, but I'm still unclear on the best approach. Or is there another approach I haven't thought of.

*The post When (and why) should you take the log of a distribution (of numbers)? talks about transformations, but it doesn't compare and contrast them to not transforming your data and using a different test.

By "locations" do you mean *geographical locations* of the observations or *statistical locations* of the underlying distributions? Specifically, do you need to test a difference in means or would it suffice just to find a difference in "central tendencies," in any way those might be conveniently expressed? — whuber, Jun 20 '17 at 22:21
@whuber geographic locations. Differences in central tendencies might work. Could you elaborate on what that means? — phaser, Jun 20 '17 at 22:31
A t-test compares means. Most non-parametric tests compare medians or assess "stochastic dominance." When you compare the means of (nonlinearly) transformed distributions, that is tantamount to comparing some *other* kind of average of the original distribution. That's why knowing precisely what you're testing may be important. — whuber, Jun 20 '17 at 23:13
@whuber Thanks. That makes sense. In my case, I think means would be the best. See my update for more specifics on my study. — phaser, Jun 20 '17 at 23:36
It does talk about distributions, moreover, a t-test is a (simple) model; see: [How are regression, the t-test, and the ANOVA all versions of the general linear model?](https://stats.stackexchange.com/q/59047/7290) — gung - Reinstate Monica, Jun 20 '17 at 23:42
@gang those resources are useful. See my update for why I think my question is different. — phaser, Jun 21 '17 at 00:23
This sounds like an XY problem: you are asking for help with techniques that might not be appropriate for your situation. Why not just tell us what you are trying to accomplish, what your data are like, and what thoughts and concerns you have about using those data to achieve your aims? — whuber, Jun 21 '17 at 13:57
@whuber I added some details about my study under Edit Purpose of my study. Please let me know if that is enough information. — phaser, Jun 21 '17 at 15:27

score 0 · Answer 1 · answered Jun 21 '17 at 00:51

Just a couple of ideas/thoughts for data transformation, hope they would be anyhow helpful, though they are from the finance/economics data perspective and not from the "environmental" data. In finance/economics if we have daily data we can use "logarithmic returns" which is simply for your case (avoiding time-series data notation) can be expressed as: $$log(X_{i})-log(X_{i-1})$$ as they are not much different from "arithmetic returns" see this link for illustration it has a nice graph there just scroll it down a bit (though not most credible source but just for the sake of illustration). This is often is referred as first differencing (at least in econometrics) as for log transformed data differencing is identical to division, so you do simple log transformation first and then just subtracting. However, if you find such first differencing difficult to justify in your field how about arithmetic returns which using same notation are simply (hope the expression below is correct): $$\left(\left(\dfrac{X_{i}}{X_{i-1}}\right)-1\right)*100$$ and can be interpreted as a gain or loss with respect to your previous observation. Most of the times it will make data stationary and more like normally distributed (though there is enough evidence that financial returns are not normally distributed).

As for the $t$ test you are willing to conduct results on the transformed (arithmetic returns) data (if and only if transformation "improves" your data) should be valid and robust to validate or reject your hypothesis. In general you are touching a very big part/question which is difficult to summarize in one answer (yet I am sill a student and learning myself so might be wrong). Will that transformation be enough for your specific case it is hard to tell. But a general comment from basic time-series econometrics where such transformations are common if you are running a regression on time-series returns it is likely that there is an issue of heteroscedasticity which in fact affects your t-stats automatically reported by the most software packages allowing such operations. Accounting for heteroscedasticity should provide you with heteroscedasticity robust t-stats. Unfortunately, "diagnostics" of your results is never a trivial process and may require other steps and tests to take to finalize your conclusions. Hope that helps.

For my study design: When to log-transform data vs. when to use a non-parametric approach

1 Answers1