How do I cite the iris dataset in a paper?

Question

I want to use the iris dataset provided by scikit-learn for a paper. But I don't know what the standard for referencing datasets is. What citation should I use for this dataset in my paper? Should I reference scikit-learn? Ronald Fisher for having introduced the dataset? Edgar Anderson for having collected the data? All of the above?

If you're going to reference Fisher, you need to spell his name right. My suggestion: if Fisher gave a reference when he first used it, use that reference. If he didn't, reference him (I think most people do that). — Glen_b, Oct 12 '14 at 22:15
Unless you are restricted in the number of references, there is no harm in citing both. I find I've never read the Anderson original, but I wouldn't assume he didn't analyse the data unless you have read it too. The Fisher reference is important; my wild guess is that the dataset would have faded into statistical obscurity without Fisher making it prominent. — Nick Cox, Oct 13 '14 at 12:50
http://stats.stackexchange.com/questions/74776/what-aspects-of-the-iris-data-set-make-it-so-successful-as-an-example-teaching/74901 doesn't answer your question, but it comments on some common minor errors in working with this dataset. — Nick Cox, Oct 13 '14 at 12:51
I would cite both papers (Anderson, 1936; Fisher, 1936), but not `scikit-learn`, as the dataset is simply *bundled* with the library, but is *not unique* to it (for example, the same `iris` dataset is bundled with `R` environment, as well). — Aleksandr Blekh, Oct 13 '14 at 13:58
@aleksandr Blekh - The OP is using a dataset provided by scikit-learn. The page on which the dataset appears mentions "If you use the software, please consider citing scikit-learn". Why would you not cite scikit-learn then? — martino, Oct 13 '14 at 14:20
@martino: The `scikit-learn` certainly has to be cited, if used. However, the OP's question was in regard to citing the `iris` dataset, which calls for an **independent citation**. This is because the dataset is an *independent entity*, which is included in many software packages and is not unique to `scikit-learn`. (By the way, it wasn't me, who downvoted your answer, in case you are curious.) — Aleksandr Blekh, Oct 14 '14 at 04:44
@AleksandrBlekh I think your comment is the answer to my question. — usernumber, Nov 27 '14 at 00:25
All right. Then I will submit my comment as the answer, so that you could upvote and accept it, if you wish. Always glad to help. — Aleksandr Blekh, Nov 27 '14 at 04:13

score 6 · Accepted Answer · answered Nov 27 '14 at 04:24

I would cite both papers (Anderson, 1936; Fisher, 1936), but not scikit-learn, as the dataset is simply bundled with the library, but is not unique to it (for example, the same iris dataset is bundled with R environment, as well). Having said that, scikit-learn certainly has to be cited as well, if used, but not due to use of the dataset.

martino · Answer 2 · 2020-03-19T12:24:23.407

-3

I think that citing Scikit-learn is sufficient. According to Scikit-Learn documentation you should cite their paper. You can always add a reference the Iris datset in Scikit-Learn by providing a link to the page.

EDIT - I stand corrected. The accepted answer is spot on

edited Mar 19 '20 at 12:24

answered Oct 13 '14 at 12:25

martino

1,630
11
17

4

Part of the purpose of a citation is to allow others to find the same data source and a link would serve this purpose very well. But another purpose of citation is to provide academic credit - which in this case belongs less to the developers, but more to the researchers who collated (and perhaps popularised) the data set. It's in this second aspect I find this answer less convincing. – Silverfish Nov 27 '14 at 07:28

How do I cite the iris dataset in a paper?

2 Answers2