4

I'm interested in the process of testing or validating a particular implementation of a statistical method, and what datasets and/or published analysis exist that could be used to do this in practice.

For instance, if I write an algorithm to implement a simple linear regression, I might feed in some numbers and check the result looks good, or I might feed numbers into my code and some other system and compare. In some cases, people seem to have already done this and then publish the numbers and results which could be defined as reference data.

To start off, the best one I know is the NIST Statistical Reference Datasets page that publishes a wide range of datasets and calculations that covers areas such as Analysis of Variance, Linaear Regression, Markov Chain and Monte Carlo simulation and Non-Linear regression.

Are there any other good / notable ones out there.

Edit: I reworded to make clear that I'm not just looking for open datasets, but I'm interested in datasets and solutions to specific statistical problems that could be used to test an implementation of a technique.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
PaulHurleyuk
  • 1,549
  • 3
  • 16
  • 18
  • 3
    possible duplicate of [Locating freely available data samples](http://stats.stackexchange.com/questions/7/locating-freely-available-data-samples) –  Aug 10 '10 at 13:55
  • Maybe that going into a more fused and detailed catalogue such as Free data set for very high dimensional classification: http://stats.stackexchange.com/questions/973/free-data-set-for-very-high-dimensional-classification could be an idea but otherwise I also feel this is a duplicate. – robin girard Aug 10 '10 at 14:17
  • 1
    I voted to close as duplicate. – Shane Aug 10 '10 at 15:36
  • @Shane if I could I would do the same – robin girard Aug 10 '10 at 16:01
  • I already voted to close. Perhaps, one of the of the moderators could close this question. –  Aug 10 '10 at 16:16
  • It's just my two cents, but I think if we limit stats.se to 1 datasets question it makes it harder to share that knowledge. I'm sorry if I didn't word my question/explanation better to differentiate this question, but I foresee there being lots of 'what open dataset exists to do x?' where x is what stops them from all being dupes. – PaulHurleyuk Aug 10 '10 at 17:55
  • I think it makes a lot of sense for this question to remain open if you could edit the question to indicate how your request is different from the one that exists and why the answers to the existing question do not satisfy your requirements. See Robin's comment as an example of what I am suggesting. –  Aug 10 '10 at 18:24
  • I like your question (presumably after being edited). It sounds like your question is not especially about datasets. Rather its about the resources and infrastructure for comparing reference results to a new implementation. – Jeromy Anglim Aug 11 '10 at 04:53

1 Answers1

2

See the stackoverflow question on this subject: Datasets for Running Statistical Analysis on.

I would reiterate my answer, that R contains (in packages) many of the canonical datasets for specific statistical problems.

Shane
  • 11,961
  • 17
  • 71
  • 89
  • 1
    I have seen that question, and your answer was good (and I do rely on a lot of the datasets that ship with R) but not many of them include defined (or should I say well characterized) answers to particular statistical questions. (although I believe that if the dataset has been used in an example to a function in the helpfile, then the package will include the results of running that example, so you could use that) – PaulHurleyuk Aug 10 '10 at 17:58