Free data set for very high dimensional classification

Question

What are the freely available data set for classification with more than 1000 features (or sample points if it contains curves)?

There is already a community wiki about free data sets: Locating freely available data samples

But here, it would be nice to have a more focused list that can be used more conveniently, also I propose the following rules:

One post per dataset
No link to set of dataset
each data set must be associated with
- a name (to figure out what it is about) and a link to the dataset (R datasets can be named with package name)
- the number of features (let say it is p) the size of the dataset (let say it is n) and the number of labels/class (let say it is k)
- a typical error rate from your experience (state the used algorithm in to words) or from the litterature (in this last case link the paper)

+1, but the ones from NIPS2003 have train.labels only -- the NIPS2003 paper says clearly "validation and test set labels are withheld". — denis, Oct 24 '11 at 10:00
Thanks. The comment about NIPS is for the answer from @mbq . — robin girard, Oct 24 '11 at 10:02
Anyone here have a high dimensional dataset with more than two class labels? — hlin117, Nov 14 '15 at 02:48

score 3 · Answer 1 · edited Jul 30 '10 at 18:00

3

Arcene
n=900
p=10000 (3k is artificially added noise)
k=2 (~balanced)
From NIPS2003.

edited Jul 30 '10 at 18:00

Peter Smit

answered Jul 29 '10 at 22:30

score 3 · Answer 2 · 2010-07-29T22:41:53.950

3

Dexter
n=2600
p=20000 (10k+53 is artificial noise)
k=2 (balanced)
From NIPS2003.

edited Jul 29 '10 at 22:41

answered Jul 29 '10 at 22:32

I don't understand... one set per person? – Jul 30 '10 at 17:40
@robin & @mbq I would suggest keeping it one dataset per post. This so people can indicate with votes which of the suggested ones there also suggest/support – Peter Smit Jul 30 '10 at 17:59
@Peter, OK, I follow your idea, I have changed the question accordingly. – robin girard Jul 31 '10 at 06:14

score 3 · Accepted Answer · 2010-07-29T22:41:33.450

3

Dorothea
n=1950
p=100000 (0.1M, half is artificially added noise)
k=2 (~10x unbalanced)
From NIPS2003.

edited Jul 29 '10 at 22:41

answered Jul 29 '10 at 22:35

Can you explain how this is 100000 features? I look at the training data and each line has maybe 2500 integers per line. – JeremyKun Jan 19 '16 at 00:03
It is a sparse array, integer N means attribute N value is 1. – Jan 19 '16 at 14:27

score 3 · Answer 4 · answered Jul 29 '10 at 22:38

3

Gisette
n=13500
p=5000 (half is artificially added noise)
k=2 (balanced)
From NIPS2003.

answered Jul 29 '10 at 22:38

robin girard · Answer 5 · 2010-11-09T08:29:57.363

2

Prostate (gene expression array)

Available via (among other) R package spls name of the dataset: prostate

error rate = 3/102 (see here) also I think there are paper which show 1/102 error rate. I would say this is an easy test case.

edited Nov 09 '10 at 08:29

answered Aug 08 '10 at 19:02

robin girard

5 Answers5