19

I am new to machine learning and looking for some datasets through which i can compare and contrasts the differences between different machine learning algorithms (Decision Trees, Boosting, SVM and Neural Networks)

Where can I find such datasets ? What should I be looking for while considering a dataset ?

It would be great if you can point to some good datasets and also tell me what makes them a good dataset?

Bunny Rabbit
  • 181
  • 1
  • 1
  • 4
  • 6
    I wonder if this question does not fit better for http://opendata.stackexchange.com/ ... As about datasets, most textbooks mention such datasets and make them available, many are already available in statistical software or in libraries for such software. You can see also https://archive.ics.uci.edu/ml/datasets.html . Of course, another question is what makes some datasets "good" for learning and some "bad" - it is an interesting question. – Tim Aug 31 '15 at 08:06
  • You will find some datasets as packages on CRAN, like: ElemStatLearn and others. – kjetil b halvorsen Aug 31 '15 at 09:18
  • 2
    @Tim Because there is a *pedagogical* aspect to this question (for example, one example of a "good" data set for learning purposes is one that shows where different algorithms give very different results) I think it's better suited to CV than to OpenData. – Silverfish Aug 31 '15 at 14:59
  • 3
    I think questions about data sets from a pedagogical point of view are definitely on-topic here: e.g. [What aspects of the “Iris” data set make it so successful as an example/teaching/test data set](http://stats.stackexchange.com/questions/74776/what-aspects-of-the-iris-data-set-make-it-so-successful-as-an-example-teaching); [Datasets constructed for a purpose similar to that of Anscombe's quartet](http://stats.stackexchange.com/questions/80196/datasets-constructed-for-a-purpose-similar-to-that-of-anscombes-quartet) – Silverfish Aug 31 '15 at 15:01
  • 1
    @Silverfish: This has been discussed on Meta - [“Questions about Datasets”: Possible Exceptions?](http://meta.stats.stackexchange.com/q/2096/17230) - & there seems to have been general agreement with your point of view. But I still think this q. is rather broad - what clearly distinguishes it from [Locating freely available data samples](http://stats.stackexchange.com/q/7/17230)? – Scortchi - Reinstate Monica Sep 02 '15 at 15:29
  • @Scortchi Thanks - I would certainly be happier with a narrower thread - one that focuses on "good" data sets for a particular pedagogical point (eg to highlight the difference between two specific algorithms). But I'm uncertain whether this is too broad to be answerable. There are certainly some "well-known data sets that learners often practise ML on" and in that respect it might differ from your link, but I sympathise with your viewpoint. – Silverfish Sep 02 '15 at 15:40

5 Answers5

16

The data sets in the following sites are available for free. These data sets have been used to teach ML algorithms to students because for most there are descriptions with the data sets. Also, it's been mentioned which kind of algorithms are applicable.

  1. UCI- Machine Learning repository
  2. ML Comp
  3. Mammo Image
  4. Mulan
Learner
  • 1,528
  • 3
  • 18
  • 34
11

Kaggle has a whole host of datasets you can use to practice with.

(I'm surprised it wasn't mentioned so far!)

It's got two things (among many others) that make it a highly invaluable resource:

  • Lots of clean datasets. While noise-free datasets aren't really representative of real-world datasets, they're especially suited for your purpose - deploying ML algorithms.
  • You can also view others' ML models for the same dataset, which could be a fun way to pick up some hacks along the way. It goes without saying that the kind of exposure you get from learning from the best practitioners is, like for anything else, super helpful.
nz_21
  • 231
  • 2
  • 12
  • 1
    This really should be the top answer because in addition to an enormous variety of datasets, the forums for each challenge are an invaluable resource for picking up techniques and tricks, along with code examples. – Alex R. Apr 14 '17 at 18:18
2

First, I'd recommend starting with the sample data that is provided with the software. Most software distributions include example data that you can use to get familiar with the algorithm without dealing with data types and wrestling the data into the right format for the algorithm. Even if you are building an algorithm from scratch, you can start with the sample from a similar implementation and compare the performance.

Second, I'd recommend experimenting with synthetic data sets to get a feel for how the algorithm performs when you know how the data was generated and the signal to noise ratio.

In R, you can list all dataset in the currently installed packages with this command:

data(package = installed.packages()[, 1])

The R package mlbench has real datasets and can generate synthetic datasets that are useful for studying algorithm performance.

Python's scikit-learn has sample data and generates synthetic/toy dataset too.

SAS has training dataset available for download and the SPSS sample data is installed with the software at C:\Program Files\IBM\SPSS\Statistics\22\Samples

Lastly, I'd look at data in the wild. I'd compare the performance of different algorithms and tuning parameters on real data sets. This usually requires a lot more work because you will rarely find dataset with data types and structures that you can drop right into your algorithms.

For data in the wild, I'd recommend:

reddit's Dataset Archive

KDnugget's list

brandco
  • 341
  • 1
  • 3
  • 1
    For those who don't have R, & don't want to download it just to get access to these datasets, the datasets & descriptions are available online [here](https://vincentarelbundock.github.io/Rdatasets/datasets.html). – gung - Reinstate Monica Apr 17 '17 at 20:08
0

In my opinion, you can should start with small datasets which do not have too many features.

One example would be the Iris dataset (for classification). It has 3 classes, 50 samples for each class totaling 150 data points. One excellent resource to help you explore this dataset is this video series by Data School.

Another dataset to checkout is the Wine Quality data set from UCI -ML repository. It has 4898 data points with 12 attributes.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
0

The Iris data set hands down. It's in base R as well.

apples-oranges
  • 223
  • 1
  • 7
  • 2
    Please respond to the substantive part of the question: "... also tell me what makes them a good dataset?" – whuber Apr 05 '17 at 20:42