6

I have been having a hard time deciding which statistical test to choose for a dataset. The more a read on the web, the more I get confused since frequently there are different opinions when it comes to chose the right test.

To this extent, when in doubt, I apply one parametric and one non-parametric tests, for example, an one-way ANOVA and a Kruskal-Wallis, or a two-sample t-test and a Mann-Whiteney, hoping that both tests give me the same output (generally $p < 0.05$). If they do, I am done; if not, then I need to work harder.

Is there some well recognized site out there that provides some kind of decision support tree for choosing statistical tests?

Is there some tool that checks as much as possible the assumptions of a statistical test on a given dataset before applying it? For example, for one-way ANOVA it could check for normality and variance homogeneity automatically!

I think such site or tool would help a lot, but probably I am asking too much ...

Thanks

Macro
  • 40,561
  • 8
  • 143
  • 148
mljrg
  • 551
  • 1
  • 7
  • 13
  • 10
    Having `some kind of decision support tree` is always beneficial. But `software designed to automatically check the assumptions`, if it means "automatically choosing the most appropriate test" is a project for castrating a man mentally. Unfortunately, some otherwise respectible packages (e.g. SPSS) have already started deadly efforts in this direction. – ttnphns Jun 30 '12 at 06:20
  • 1
    Evolution is to move forward into supporting higher-level abstractions to make human lives easier (e.g., computer languages and handheld scientific calculators with lots of high-level functions), so I think those "deadly efforts" are welcome. (For example, being a programmer, I don't imagine going back to use assembly language.) Anyway, I think there should be support for the statistian who wants to control every analysis parameter, and also for those who occasionally need some statistical assessment. – mljrg Jun 30 '12 at 08:47
  • mljrg: as a self-justifying policy and value, "making human lives easier" is a road to death caused by amusement. A statistician must reflect to himself about every decision made (why did I prefer this test not that). This reflection should have taste of bitterness because we seek for justification of basically free, equivocal choice. "Making live easier" is a trick to escape the bitterness. – ttnphns Jun 30 '12 at 09:17
  • 1
    @ttnphns: I agree with "A statistician must reflect to himself about every decision made (why did I prefer this test not that)", but better *practical assistance* doing that would be *extremelly helpful* has the options can be very hard to decide. (See my comment below to Michael Chernick where I give a great example of that in software programming) – mljrg Jun 30 '12 at 09:46
  • 6
    I too wholeheartedly disagree with @mljrg . Two things explain why. First think of all the silly tests of normality that people do to justify the use of the $t$-test, assuming the test has a power of 1.0. Secondly, there will be wide disagreement among statisticians on the exact "rules" for "satisfying assumptions". – Frank Harrell Jun 30 '12 at 17:45
  • I think this thread is related to the topic at hand: http://stats.stackexchange.com/questions/22572/what-do-statisticians-do-that-cant-be-automated/22591#22591 – Macro Jul 01 '12 at 05:03
  • 1
    @Macro: that's a good topic. Clearly not everything can be automated (I never meant that when I wrote above "tool that checks **as much as possible** the assumptions of a statistical test") but once I decided for a statistical test based on knowledge of the sampled data, probably some other assumptions could be checked automatically by the chosen test function in the package being used. – mljrg Jul 01 '12 at 12:49
  • 1
    I have to agree w/ everyone else here. Re: having your software automatically check if your data are normal, you might find this question worth reading: [Normality Testing: 'Essentially Useless'?](http://stats.stackexchange.com/questions/2492/). – gung - Reinstate Monica Jul 01 '12 at 17:42
  • The Sigmaplot software automatically tests the assumptions for some basic statistical tests. See: https://systatsoftware.com/products/sigmaplot/sigmaplot-statistical-analysis/ – user182296 Oct 25 '17 at 18:53
  • I would suggest Statwing software which chooses statistical tests automatically. (https://www.statwing.com/) – Ahmed Negida May 28 '17 at 00:23

3 Answers3

10

The information needed to decide if the assumptions about a statistical test are reasonable are often exterior to the data itself. This means that an automated program would not have the information needed. For example it is usually assumed that the data was collected independently (or conditionally independently), but looking at the data how can you tell the difference between a simple random sample (usually fine for many stats tests) and a snowball sample (not good for most quantitative tests)? Since a simple random sample has every possible sample equally likely then any non independent sample could have also resulted from a simple random sample. You need to know how the data was collected, not just the data itself.

Also note that if you do a normality test in order to decide which test to use then you are generally either getting a meaningless answer to a meaningful question (small sample size) or a meaningful answer to a meaningless question (large sample size). I expect that many of the other tests for assumptions (without outside knowledge) will have similar problems.

If you "test" for every assumption that could affect the results of your test (without outside knowledge suggesting which might be the most meaningful) then you are likely to either always reject at least one assumption (if you don't correct for multiple comparisons) or you will have so little power to detect assumption violations (when you do correct for multiple comparisons) that the results will be little better than generating a p-value from a uniform distribution. Knowledge of the science that lead to the data is needed to assess which assumptions to further investigate (and plots are probably as useful as formal tests).

Also note that the non-parametric tests mentioned above and the normal based tests above are testing different null hypotheses. If the results don't agree it could be that both are giving correct (or at least approximately, with a good approximation, correct) answers to different questions.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • I completely agree with your points about the need of outside knowledge to decide which test, but once you use that knowledge to pick a test, I think the test could further automate some other assumptions. My dataset comprises a dependent variable "elapsed time" measured for three treatments, but since the size of the treatments are small (<= 21), from what you said I conclude that checking for normality is meaningless, so I should go for a non-parametric test (Kruskal-Wallis). Thanks. – mljrg Jul 01 '12 at 12:41
  • 1
    @mljrg Wrong conclusion! Doing a normality test is meaningless, but that does not mean that normality-based tests cannot be used. However, time-to-event variables such as elapsed time are usually quite skewed and heteroskedastic (external information!) It is easy to misinterpret even non-parametric tests under these conditions. Other options such as a log-rank test might be a good alternative. But, to reiterate, you need to incorporate your knowledge about the outcome variable. – Aniko Jul 02 '12 at 13:58
6

I think you should look at applied statistics texts. An easy one to read the is one of my favorites was written by the late Rupert Miller (I took the applied statistics sequence that he taught when I was a graduate student at Stanford). At that time we had notes. His book was not finished but it is a marvel. He was a great teacher and writer. The book was published by Wiley and is titled Beyond ANOVA, Basics of Applied Statistics (on Amazon). It was originally published by Wiley but apparently is reprinted currently by Chapman & Hall/CRC. This goes through all the assumptions needed for parametric ANOVA and the methods to check them.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • Too much expensive, especially for me who will be using statistics occasionally, not as daily work ... but indeed a good (and less expensive) applied statistics text would be helpful (something to the point, like a kind of recipe). But which one? – mljrg Jun 30 '12 at 00:58
  • A very useful reference in software programming is [Design Patterns: Elements of Reusable Object-Oriented Software](http://www.amazon.com/Design-Patterns-Elements-Reusable-Object-Oriented/dp/0201633612), which provides systematic and well-justified solutions for common problems and their variants. Is there something like this for statistics in practice? Is there a good reference that lists for several "data analysis problems" the corresponding "statistical tests"? This would be *very helpful* for the occasionally statistian, like me. – mljrg Jun 30 '12 at 09:43
  • @mljrg, the price differences are mostly due to supply & demand; it costs a lot to get a book out & there just aren't that many sales over which to distribute those costs. As a result, statistics texts are pretty much *always* expensive :-(! OTOH, there is a *much* bigger market for programming books, making them cheaper. – gung - Reinstate Monica Jul 01 '12 at 17:33
  • @gung I have a solution to raise the demand: Do you know a good **practical** statistics book? I do not want a book with lots of formulas on how to compute statistics, etc, because that's what statistical packages are meant to. I am looking for a book that takes a practical case-based approach to statistics, from which I could learn which statistical test to apply to common example, why, and how to interpret the results. Is there such statistics book out there? That you be **extremely helpful** – mljrg Jul 01 '12 at 23:24
  • 2
    mljrg, following on @GregSnow's answer, you need to know about those formulas (etc) in order to know what's going on (or will be if you do different things) so that you can make intelligent choices about which test to use & how to interpret it. Re: the book that will help you best in *your* situation, I would put a lot of stock in what Michael Chernick has offered, b/c he is the master of the topic. Re: the 1 *overall* book for someone to read, you may want to read [this CV question](http://stats.stackexchange.com/questions/27553/) & my answer there. – gung - Reinstate Monica Jul 02 '12 at 01:26
  • @gung I don't agree with your opinion about the need to know the formula details to apply statistics. I expect that knowing the assumptions of a statistical test and how to interpret its results should be more than enough to use that statistical test in practice. Beyond this, you should only need to know how to run the test in a tool. After all, most people are able to drive cars without knowing their internals ... ;-) (Of course, everyday statistians should know the formulas, I think, for most intricate "test tuning", but occasional users shouldn't need them) – mljrg Jul 02 '12 at 15:16
  • @mljrg Rupert Miller's book has formulas and I think all good books need to have formulas. But Miller's book does not use formulas to excess and he gives good motivation for the methods he discusses. I think there are many good applied statistics books at a lower level than Miller's (but I think Miller's is the best at explaining the asumptions in the linear model and how to check them). – Michael R. Chernick Jul 02 '12 at 15:51
  • 1
    David Moore's The Basic Practice of Statistics is one good one to look at http://www.amazon.com/The-Basic-Practice-Statistics-Student/dp/1429224266/ref=sr_1_1?s=books&ie=UTF8&qid=1341244057&sr=1-1&keywords=the+basic+practice+of+statistics+5th+edition+by+david+s.+moore I also think the books by DeVore and Peck and the ones by DeVeaux are also excellent. – Michael R. Chernick Jul 02 '12 at 15:51
  • @MichaelChernick I do computer science research, and sometimes I need to do user controlled experiments, but generally, I can only get small samples because users are hard to get. Mainly, I take measures of task durations, how people use the tools (these sometimes implement different treatments), and users frequently answer surveys/questionnaires with Likert scales. So, in light of what I do, which statistics do I need, and which books do you recommend? Do these books you listed above are suitable for my work? Or there are others more suitable? Thanks – mljrg Jul 02 '12 at 16:31
  • Statistical methodology does not change according to sample size. In small samples it is just harder to make inferences. Since you deal with ordinal or more generally categorical data you want books that include contingency table analysis. i think most of the general introductory books I have mentioned do. But there are also books that specialize in categorical data analysis. See the links on books by Agresti. http://www.amazon.com/Introduction-Categorical-Analysis-Probability-Statistics/dp/0471226181/ref=sr_1_1?s=books&ie=UTF8&qid=1341346633&sr=1-1&keywords=alan+agresti – Michael R. Chernick Jul 03 '12 at 20:19
  • http://www.amazon.com/Categorical-Analysis-Series-Probability-Statistics/dp/0471360937/ref=sr_1_4?s=books&ie=UTF8&qid=1341346684&sr=1-4&keywords=alan+agresti http://www.amazon.com/Analysis-Ordinal-Categorical-Probability-Statistics/dp/0470082895/ref=sr_1_5?s=books&ie=UTF8&qid=1341346768&sr=1-5&keywords=alan+agresti – Michael R. Chernick Jul 03 '12 at 20:19
  • @MichaelChernick Thanks for those references! While searching in the internet I also found this [one](http://www.uahsj.ualberta.ca/files/Issues/3-1/pdf/20.pdf) and [this](http://www.vetepi.uzh.ch/services/training/course_booklet.pdf) very useful. – mljrg Jul 04 '12 at 00:01
1

This is old but your library may have it:

"A guide for selecting statistical techniques for analyzing social science data" 2nd ed 1981; institute for social research,university of michigan Andrews, FM; Klem, L; Davidson, TN; O'Malley, PM; rodgers, WL

Laurence