4

The famous Anscombe data set is often used to illustrate various problems in regression such as nonlinear relationships, outliers and influential points. But, with N = 10, there is a limited amount one can do.

Also, simple plots of the data make it pretty clear that the various simple linear regressions are problematic.

I thought of generating a larger data set (say, N = 1000) with similar patterns, by jittering each of the variables in the Anscombe data. Has anyone used this sort of approach as a didactic tool to illustrate violations of the OLS assumptions and ways to find them?

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 4
    At https://stats.stackexchange.com/questions/152028/i-have-a-line-of-best-fit-i-need-data-points-that-will-not-change-my-line-of-be/152034#152034 I posted code to create datasets like Anscombe's quartet--and demonstrated its efficacy by simply and easily producing the quartet itself. – whuber May 28 '19 at 21:26
  • @whuber Very cool. Are there any restrictions on using it? If I use it in a course or something, do you want a citation? – Peter Flom May 29 '19 at 10:51
  • 2
    It's yours to keep, Peter :-). – whuber May 29 '19 at 10:57

1 Answers1

4

Have you seen the Datasaurus Dozen? It's a bigger (and of course more extreme...) example of the Anscombe data. An alternative might be to dig around for some real data that has the kind of problems you are looking for. https://www.autodeskresearch.com/publications/samestats

John Davis
  • 62
  • 3