8

I want to use the iris dataset provided by scikit-learn for a paper. But I don't know what the standard for referencing datasets is. What citation should I use for this dataset in my paper? Should I reference scikit-learn? Ronald Fisher for having introduced the dataset? Edgar Anderson for having collected the data? All of the above?

Wolfgang
  • 15,542
  • 1
  • 47
  • 74
usernumber
  • 183
  • 1
  • 9
  • 4
    If you're going to reference Fisher, you need to spell his name right. My suggestion: if Fisher gave a reference when he first used it, use that reference. If he didn't, reference him (I think most people do that). – Glen_b Oct 12 '14 at 22:15
  • 2
    Unless you are restricted in the number of references, there is no harm in citing both. I find I've never read the Anderson original, but I wouldn't assume he didn't analyse the data unless you have read it too. The Fisher reference is important; my wild guess is that the dataset would have faded into statistical obscurity without Fisher making it prominent. – Nick Cox Oct 13 '14 at 12:50
  • 5
    http://stats.stackexchange.com/questions/74776/what-aspects-of-the-iris-data-set-make-it-so-successful-as-an-example-teaching/74901 doesn't answer your question, but it comments on some common minor errors in working with this dataset. – Nick Cox Oct 13 '14 at 12:51
  • 2
    I would cite both papers (Anderson, 1936; Fisher, 1936), but not `scikit-learn`, as the dataset is simply *bundled* with the library, but is *not unique* to it (for example, the same `iris` dataset is bundled with `R` environment, as well). – Aleksandr Blekh Oct 13 '14 at 13:58
  • @aleksandr Blekh - The OP is using a dataset provided by scikit-learn. The page on which the dataset appears mentions "If you use the software, please consider citing scikit-learn". Why would you not cite scikit-learn then? – martino Oct 13 '14 at 14:20
  • 5
    @martino: The `scikit-learn` certainly has to be cited, if used. However, the OP's question was in regard to citing the `iris` dataset, which calls for an **independent citation**. This is because the dataset is an *independent entity*, which is included in many software packages and is not unique to `scikit-learn`. (By the way, it wasn't me, who downvoted your answer, in case you are curious.) – Aleksandr Blekh Oct 14 '14 at 04:44
  • 1
    @AleksandrBlekh I think your comment is the answer to my question. – usernumber Nov 27 '14 at 00:25
  • All right. Then I will submit my comment as the answer, so that you could upvote and accept it, if you wish. Always glad to help. – Aleksandr Blekh Nov 27 '14 at 04:13

2 Answers2

6

I would cite both papers (Anderson, 1936; Fisher, 1936), but not scikit-learn, as the dataset is simply bundled with the library, but is not unique to it (for example, the same iris dataset is bundled with R environment, as well). Having said that, scikit-learn certainly has to be cited as well, if used, but not due to use of the dataset.

Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
-3

I think that citing Scikit-learn is sufficient. According to Scikit-Learn documentation you should cite their paper. You can always add a reference the Iris datset in Scikit-Learn by providing a link to the page.

EDIT - I stand corrected. The accepted answer is spot on

martino
  • 1,630
  • 11
  • 17
  • 4
    Part of the purpose of a citation is to allow others to find the same data source and a link would serve this purpose very well. But another purpose of citation is to provide academic credit - which in this case belongs less to the developers, but more to the researchers who collated (and perhaps popularised) the data set. It's in this second aspect I find this answer less convincing. – Silverfish Nov 27 '14 at 07:28