Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Datasets are structured data files in any format, collected together with the documentation that explains their production or use.

1779 questions
169
votes
16 answers

Are large data sets inappropriate for hypothesis testing?

In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of no effect is true — will declare a statistically…
103
votes
25 answers

Locating freely available data samples

I've been working on a new method for analyzing and parsing datasets to identify and isolate subgroups of a population without foreknowledge of any subgroup's characteristics. While the method works well enough with artificial data samples (i.e.…
EAMann
  • 163
  • 3
  • 4
  • 7
94
votes
6 answers

Essential data checking tests

In my job role I often work with other people's datasets, non-experts bring me clinical data and I help them to summarise it and perform statistical tests. The problem I am having is that the datasets I am brought are almost always riddled with…
Chris Beeley
  • 5,465
  • 5
  • 36
  • 40
77
votes
2 answers

How to normalize data between -1 and 1?

I have seen the min-max normalization formula but that normalizes values between 0 and 1. How would I normalize my data between -1 and 1? I have both negative and positive values in my data matrix.
covfefe
  • 1,089
  • 2
  • 10
  • 9
68
votes
8 answers

How to simulate data that satisfy specific constraints such as having specific mean and standard deviation?

This question is motivated by my question on meta-analysis. But I imagine that it would also be useful in teaching contexts where you want to create a dataset that exactly mirrors an existing published dataset. I know how to generate random data…
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
53
votes
3 answers

Data APIs/feeds available as packages in R

EDIT: The Web Technologies and Services CRAN task view contains a much more comprehensive list of data sources and APIs available in R. You can submit a pull request on github if you wish to add a package to the task view. I'm making a list of the…
Zach
  • 22,308
  • 18
  • 114
  • 158
44
votes
9 answers

Tiny (real) datasets for giving examples in class?

When teaching an introductory level class, the teachers I know tend to invent some numbers and a story in order to exemplify the method they are teaching. What I would prefer is to tell a real story with real numbers. However, these stories needs…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
43
votes
8 answers

How do I get people to take better care of data?

My workplace has employees from a very wide range of disciplines, so we generate data in lots of different forms. Consequently, each team has developed its own system for storing data. Some use Access or SQL databases; some teams (to my horror)…
Richie Cotton
  • 644
  • 9
  • 15
40
votes
2 answers

How to draw valid conclusions from "big data"?

"Big data" is everywhere in the media. Everybody says that "big data" is the big thing for 2012, e.g. KDNuggets poll on hot topics for 2012. However, I have deep concerns here. With big data, everybody seems to be happy just to get anything out. But…
Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
36
votes
5 answers

Free data set for very high dimensional classification

What are the freely available data set for classification with more than 1000 features (or sample points if it contains curves)? There is already a community wiki about free data sets: Locating freely available data samples But here, it would be…
robin girard
  • 6,335
  • 6
  • 46
  • 60
35
votes
3 answers

Datasets constructed for a purpose similar to that of Anscombe's quartet

I've just come across Anscombe's quartet (four datasets that have almost indistinguishable descriptive statistics but look very different when plotted) and I am curious if there are other more or less well-known datasets that have been created to…
Hibernating
  • 3,723
  • 2
  • 21
  • 34
35
votes
5 answers

What if my linear regression data contains several co-mingled linear relationships?

Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear…
SlowMagic
  • 613
  • 6
  • 9
34
votes
2 answers

Performing a statistical test after visualizing data - data dredging?

I'll propose this question by means of an example. Suppose I have a data set, such as the boston housing price data set, in which I have continuous and categorical variables. Here, we have a "quality" variable, from 1 to 10, and the sale price. I…
32
votes
3 answers

Visualizing the intersections of many sets

Is there a visualization model that is good for showing the intersection overlap of many sets? I am thinking something like Venn diagrams but that somehow might lend itself better to a larger number of sets such as 10 or more. Wikipedia does show…
Kyle Brandt
  • 737
  • 1
  • 6
  • 17
28
votes
2 answers

What aspects of the "Iris" data set make it so successful as an example/teaching/test data set

The "Iris" dataset is probably familiar to most people here - it's one of the canonical test data sets and a go-to example dataset for everything from data visualization to machine learning. For example, everyone in this question ended up using it…
Fomite
  • 21,264
  • 10
  • 78
  • 137
1
2 3
99 100