Questions tagged [large-data]

'Large data' refers to situations where the number of observations (data points) is so large that it necessitates changes in the way the data analyst thinks about or conducts the analysis. (Not to be confused with 'high dimensionality'.)

A sufficiently large number of observations for an analysis may require changes in the way the data analysis proceeds, or in the way it is understood.

Some examples where the process may need to be adapted are:

  • special strategies may be required if there are more data than can fit in a computer's memory,
  • the analyst may need to pay attention to the computational efficiency of different optimization algorithms,
  • consideration needs to be given to how to effectively visualize the data, if standard plots (e.g., a scatterplot) would just display a large black spot due to overlapping points.

A common example of a case where analysts conceptualize an aspect of the process differently concerns statistical significance. With sufficient data, any difference, no matter how trivial, will be 'significant'. This fact leads many analysts to view findings of significance differently than when smaller data sets are available.

531 questions
218
votes
13 answers

How should I transform non-negative data including zeros?

If I have highly skewed positive data I often take logs. But what should I do with highly skewed non-negative data that include zeros? I have seen two transformations used: $\log(x+1)$ which has the neat feature that 0 maps to 0. $\log(x+c)$ where…
Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
169
votes
16 answers

Are large data sets inappropriate for hypothesis testing?

In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of no effect is true — will declare a statistically…
114
votes
5 answers

What skills are required to perform large scale statistical analyses?

Many statistical jobs ask for experience with large scale data. What are the sorts of statistical and computational skills that would be need for working with large data sets. For example, how about building regression models given a data set with…
59
votes
7 answers

Industry vs Kaggle challenges. Is collecting more observations and having access to more variables more important than fancy modelling?

I'd hope the title is self explanatory. In Kaggle, most winners use stacking with sometimes hundreds of base models, to squeeze a few extra % of MSE, accuracy... In general, in your experience, how important is fancy modelling such as stacking vs…
Tom
  • 1,204
  • 8
  • 17
55
votes
8 answers

Is sampling relevant in the time of 'big data'?

Or more so "will it be"? Big Data makes statistics and relevant knowledge all the more important but seems to underplay Sampling Theory. I've seen this hype around 'Big Data' and can't help wonder that "why" would I want to analyze everything?…
PhD
  • 13,429
  • 19
  • 45
  • 47
54
votes
10 answers

What is a good algorithm for estimating the median of a huge read-once data set?

I'm looking for a good algorithm (meaning minimal computation, minimal storage requirements) to estimate the median of a data set that is too large to store, such that each value can only be read once (unless you explicitly store that value). There…
PeterR
  • 1,712
  • 1
  • 16
  • 13
45
votes
10 answers

What exactly is Big Data?

I have been asked on several occasions the question: What is Big-Data? Both by students and my relatives that are picking up the buzz around statistics and ML. I found this CV-post. And I feel that I agree with the only answer there. The…
Gumeo
  • 3,551
  • 1
  • 21
  • 31
40
votes
4 answers

Polynomial regression using scikit-learn

I am trying to use scikit-learn for polynomial regression. From what I read polynomial regression is a special case of linear regression. I was hopping that maybe one of scikit's generalized linear models can be parameterised to fit higher order…
40
votes
6 answers

Effect size as the hypothesis for significance testing

Today, at the Cross Validated Journal Club (why weren't you there?), @mbq asked: Do you think we (modern data scientists) know what significance means? And how it relates to our confidence in our results? @Michelle replied as some (including me)…
Carlos Accioly
  • 4,715
  • 4
  • 25
  • 28
40
votes
2 answers

How to draw valid conclusions from "big data"?

"Big data" is everywhere in the media. Everybody says that "big data" is the big thing for 2012, e.g. KDNuggets poll on hot topics for 2012. However, I have deep concerns here. With big data, everybody seems to be happy just to get anything out. But…
Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
36
votes
5 answers

Free data set for very high dimensional classification

What are the freely available data set for classification with more than 1000 features (or sample points if it contains curves)? There is already a community wiki about free data sets: Locating freely available data samples But here, it would be…
robin girard
  • 6,335
  • 6
  • 46
  • 60
30
votes
4 answers

Why should I be Bayesian when my dataset is large?

From "Why should I be Bayesian when my model is wrong?", one of the key benefits of Bayesian inference to be able to inject exogenous domain knowledge into the model, in the form of a prior. This is especially useful when you don't have enough…
kennysong
  • 931
  • 10
  • 18
27
votes
9 answers

Statistics and data mining software tools for dealing with large datasets

Currently I have to analyze approximately 20M records and build prediction models. So far I have tried out Statistica, SPSS, RapidMiner and R. Among these Statistica seems to be most suitable to deal with data mining and RapidMiner user interface is…
niko
  • 1,261
  • 3
  • 15
  • 18
25
votes
1 answer

State of art streaming learning

I have been working with large data sets lately and found a lot of papers of streaming methods. To name a few: Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1…
RUser4512
  • 9,226
  • 5
  • 29
  • 59
24
votes
1 answer

How to visualize an enormous sparse contingency table?

I have two variables: Drug Name (DN) and corresponding Adverse Events (AE), which stand in a many-to-many relation. There are 33,556 drug names and 9,516 adverse events. The sample size is about 5.8 million observations. I want to study and…
1
2 3
35 36