I want to use the iris dataset provided by scikit-learn for a paper. But I don't know what the standard for referencing datasets is. What citation should I use for this dataset in my paper? Should I reference scikit-learn? Ronald Fisher for having introduced the dataset? Edgar Anderson for having collected the data? All of the above?
Asked
Active
Viewed 2,514 times
8
-
4If you're going to reference Fisher, you need to spell his name right. My suggestion: if Fisher gave a reference when he first used it, use that reference. If he didn't, reference him (I think most people do that). – Glen_b Oct 12 '14 at 22:15
-
2Unless you are restricted in the number of references, there is no harm in citing both. I find I've never read the Anderson original, but I wouldn't assume he didn't analyse the data unless you have read it too. The Fisher reference is important; my wild guess is that the dataset would have faded into statistical obscurity without Fisher making it prominent. – Nick Cox Oct 13 '14 at 12:50
-
5http://stats.stackexchange.com/questions/74776/what-aspects-of-the-iris-data-set-make-it-so-successful-as-an-example-teaching/74901 doesn't answer your question, but it comments on some common minor errors in working with this dataset. – Nick Cox Oct 13 '14 at 12:51
-
2I would cite both papers (Anderson, 1936; Fisher, 1936), but not `scikit-learn`, as the dataset is simply *bundled* with the library, but is *not unique* to it (for example, the same `iris` dataset is bundled with `R` environment, as well). – Aleksandr Blekh Oct 13 '14 at 13:58
-
@aleksandr Blekh - The OP is using a dataset provided by scikit-learn. The page on which the dataset appears mentions "If you use the software, please consider citing scikit-learn". Why would you not cite scikit-learn then? – martino Oct 13 '14 at 14:20
-
5@martino: The `scikit-learn` certainly has to be cited, if used. However, the OP's question was in regard to citing the `iris` dataset, which calls for an **independent citation**. This is because the dataset is an *independent entity*, which is included in many software packages and is not unique to `scikit-learn`. (By the way, it wasn't me, who downvoted your answer, in case you are curious.) – Aleksandr Blekh Oct 14 '14 at 04:44
-
1@AleksandrBlekh I think your comment is the answer to my question. – usernumber Nov 27 '14 at 00:25
-
All right. Then I will submit my comment as the answer, so that you could upvote and accept it, if you wish. Always glad to help. – Aleksandr Blekh Nov 27 '14 at 04:13
2 Answers
6
I would cite both papers (Anderson, 1936; Fisher, 1936), but not scikit-learn
, as the dataset is simply bundled with the library, but is not unique to it (for example, the same iris
dataset is bundled with R
environment, as well). Having said that, scikit-learn
certainly has to be cited as well, if used, but not due to use of the dataset.

Aleksandr Blekh
- 7,867
- 2
- 27
- 93
-3
I think that citing Scikit-learn is sufficient. According to Scikit-Learn documentation you should cite their paper. You can always add a reference the Iris datset in Scikit-Learn by providing a link to the page.
EDIT - I stand corrected. The accepted answer is spot on

martino
- 1,630
- 11
- 17
-
4Part of the purpose of a citation is to allow others to find the same data source and a link would serve this purpose very well. But another purpose of citation is to provide academic credit - which in this case belongs less to the developers, but more to the researchers who collated (and perhaps popularised) the data set. It's in this second aspect I find this answer less convincing. – Silverfish Nov 27 '14 at 07:28