5

Why would exploratory data analysis be important to undertake before null-hypothesis tests?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
pkg77x7
  • 223
  • 3
  • 4
  • Perhaps if there is no precise hypothesis to test, the research should always be considered exploratory. – Flask Jan 10 '14 at 01:54
  • Usually you wouldn't do both on the same data (or at least not on the same subset). If you already have a specific hypothesis with a kind of data you understand (e.g. you've seen other samples of it), you may not need EDA at all. If you have no hypotheses or don't have any idea what assumptions might reasonable, it's sensible to look at data to generate hypotheses/assumptions ... but then you don't test the hypotheses you generate on the data that generated them (at least not unless you can adjust for that). One solution is sample splitting (EDA on part of the data, testing on another). – Glen_b Jan 10 '14 at 10:36

2 Answers2

3

It is often necessary to know a little about the system being explored before sensible hypotheses come to mind and it is very useful to know about the variation and noise in an assay prior to designing an experiment. Exploratory experiments and analyses are good for that. Don't be too quick to decide that a dataset is definitive.

Of course, you should know that hypotheses that are suggested by the data in exploratory analyses will have a high chance of giving you a spurious 'significant' result if you test them using the same data, so ideally the exploratory analyses lead to the design and running of new experiments to specifically test hypotheses.

Michael Lew
  • 10,995
  • 2
  • 29
  • 47
1

There really aren't rules on which comes first: data-driven (hypothesis-generating) analyses then hypothesis-driven analyses, or hypothesis-driven followed by data-driven.

If you knowingly want to test hypotheses and then do knowledge discovery, then you can answer questions you have, and then learn from (data-driven analyses) a part of the data that is novel (never been studied before) in order to generate hypotheses.

Otherwise, if you needed to run exploratory first to generate hypotheses, then if nothing is found -- no patterns, no clusters, no correlations -- essentially noise, then you wouldn't be able to test any hypothesis since they wouldn't exist.