36

What are the freely available data set for classification with more than 1000 features (or sample points if it contains curves)?

There is already a community wiki about free data sets: Locating freely available data samples

But here, it would be nice to have a more focused list that can be used more conveniently, also I propose the following rules:

  1. One post per dataset
  2. No link to set of dataset
  3. each data set must be associated with

    • a name (to figure out what it is about) and a link to the dataset (R datasets can be named with package name)

    • the number of features (let say it is p) the size of the dataset (let say it is n) and the number of labels/class (let say it is k)

    • a typical error rate from your experience (state the used algorithm in to words) or from the litterature (in this last case link the paper)

robin girard
  • 6,335
  • 6
  • 46
  • 60

5 Answers5

3

Arcene
n=900
p=10000 (3k is artificially added noise)
k=2 (~balanced)
From NIPS2003.

Peter Smit
  • 2,030
  • 3
  • 23
  • 36
3

Dexter
n=2600
p=20000 (10k+53 is artificial noise)
k=2 (balanced)
From NIPS2003.

3

Dorothea
n=1950
p=100000 (0.1M, half is artificially added noise)
k=2 (~10x unbalanced)
From NIPS2003.

  • Can you explain how this is 100000 features? I look at the training data and each line has maybe 2500 integers per line. – JeremyKun Jan 19 '16 at 00:03
  • It is a sparse array, integer N means attribute N value is 1. –  Jan 19 '16 at 14:27
3

Gisette
n=13500
p=5000 (half is artificially added noise)
k=2 (balanced)
From NIPS2003.

2

Prostate (gene expression array)

  • k=2
  • n=48+52
  • p=6033

Available via (among other) R package spls name of the dataset: prostate

error rate = 3/102 (see here) also I think there are paper which show 1/102 error rate. I would say this is an easy test case.

robin girard
  • 6,335
  • 6
  • 46
  • 60