Most Popular

1500 questions
95
votes
4 answers

Does the variance of a sum equal the sum of the variances?

Is it (always) true that $$\mathrm{Var}\left(\sum\limits_{i=1}^m{X_i}\right) = \sum\limits_{i=1}^m{\mathrm{Var}(X_i)} \>?$$
Abe
  • 3,561
  • 7
  • 27
  • 45
95
votes
6 answers

Convergence in probability vs. almost sure convergence

I've never really grokked the difference between these two measures of convergence. (Or, in fact, any of the different types of convergence, but I mention these two in particular because of the Weak and Strong Laws of Large Numbers.) Sure, I can…
raegtin
  • 9,090
  • 12
  • 48
  • 53
95
votes
2 answers

How much do we know about p-hacking "in the wild"?

The phrase p-hacking (also: "data dredging", "snooping" or "fishing") refers to various kinds of statistical malpractice in which results become artificially statistically significant. There are many ways to procure a "more significant" result,…
95
votes
5 answers

How to calculate Area Under the Curve (AUC), or the c-statistic, by hand

I am interested in calculating area under the curve (AUC), or the c-statistic, by hand for a binary logistic regression model. For example, in the validation dataset, I have the true value for the dependent variable, retention (1 = retained; 0 = not…
Matt Reichenbach
  • 3,404
  • 6
  • 25
  • 43
95
votes
5 answers

Loadings vs eigenvectors in PCA: when to use one or another?

In principal component analysis (PCA), we get eigenvectors (unit vectors) and eigenvalues. Now, let us define loadings as $$\text{Loadings} = \text{Eigenvectors} \cdot \sqrt{\text{Eigenvalues}}.$$ I know that eigenvectors are just directions and…
user2696565
  • 1,239
  • 1
  • 10
  • 14
95
votes
8 answers

If mean is so sensitive, why use it in the first place?

It is a known fact that median is resistant to outliers. If that is the case, when and why would we use the mean in the first place? One thing I can think of perhaps is to understand the presence of outliers i.e. if the median is far from the mean,…
Legend
  • 4,232
  • 7
  • 37
  • 50
95
votes
6 answers

What is the difference between Multiclass and Multilabel Problem

What is the difference between a multiclass problem and a multilabel problem?
Learner
  • 4,007
  • 11
  • 37
  • 39
94
votes
12 answers

Who Are The Bayesians?

As one becomes interested in statistics, the dichotomy "Frequentist" vs. "Bayesian" soon becomes commonplace (and who hasn't read Nate Silver's The Signal and the Noise, anyway?). In talks and introductory courses, the point of view is…
Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
94
votes
6 answers

Essential data checking tests

In my job role I often work with other people's datasets, non-experts bring me clinical data and I help them to summarise it and perform statistical tests. The problem I am having is that the datasets I am brought are almost always riddled with…
Chris Beeley
  • 5,465
  • 5
  • 36
  • 40
93
votes
2 answers

When to use regularization methods for regression?

In what circumstances should one consider using regularization methods (ridge, lasso or least angles regression) instead of OLS? In case this helps steer the discussion, my main interest is improving predictive accuracy.
NPE
  • 5,351
  • 5
  • 33
  • 44
93
votes
1 answer

What is an ablation study? And is there a systematic way to perform it?

What is an ablation study? And is there a systematic way to perform it? For example, I have $n$ predictors in a linear regression which I will call as my model. How will I perform an ablation study to this? What metrics should I use? A…
cgo
  • 7,445
  • 10
  • 42
  • 61
93
votes
7 answers

The Book of Why by Judea Pearl: Why is he bashing statistics?

I am reading The Book of Why by Judea Pearl, and it is getting under my skin1. Specifically, it appears to me that he is unconditionally bashing "classical" statistics by putting up a straw man argument that statistics is never, ever able to…
January
  • 6,999
  • 1
  • 32
  • 55
93
votes
4 answers

How does the correlation coefficient differ from regression slope?

I would have expected the correlation coefficient to be the same as a regression slope (beta), however having just compared the two, they are different. How do they differ - what different information do they give?
luciano
  • 12,197
  • 30
  • 87
  • 119
93
votes
9 answers

Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals

A recent question on the difference between confidence and credible intervals led me to start re-reading Edwin Jaynes' article on that topic: Jaynes, E. T., 1976. `Confidence Intervals vs Bayesian Intervals,' in Foundations of Probability Theory,…
Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
93
votes
6 answers

Principled way of collapsing categorical variables with many levels?

What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model? Consider a variable like college student major (discipline chosen by an undergraduate…