Most Popular

1500 questions
107
votes
7 answers

T-test for non normal when N>50?

Long ago I learnt that normal distribution was necessary to use a two sample T-test. Today a colleague told me that she learnt that for N>50 normal distribution was not necessary. Is that true? If true is that because of the central limit theorem?
107
votes
3 answers

Does an unbalanced sample matter when doing logistic regression?

Okay, so I think I have a decent enough sample, taking into account the 20:1 rule of thumb: a fairly large sample (N=374) for a total of 7 candidate predictor variables. My problem is the following: whatever set of predictor variables I use, the…
Michiel
  • 1,173
  • 3
  • 8
  • 5
107
votes
4 answers

What is rank deficiency, and how to deal with it?

Fitting a logistic regression using lme4 ends with Error in mer_finalize(ans) : Downdated X'X is not positive definite. A likely cause of this error is apparently rank deficiency. What is rank deficiency, and how should I address it?
Jack Tanner
  • 4,552
  • 3
  • 27
  • 39
107
votes
2 answers

What is covariance in plain language?

What is covariance in plain language and how is it linked to the terms dependence, correlation and variance-covariance structure with respect to repeated-measures designs?
abc
  • 1,747
  • 3
  • 17
  • 32
107
votes
12 answers

When should linear regression be called "machine learning"?

In a recent colloquium, the speaker's abstract claimed they were using machine learning. During the talk, the only thing related to machine learning was that they perform linear regression on their data. After calculating the best-fit coefficients…
107
votes
15 answers

US Election results 2016: What went wrong with prediction models?

First it was Brexit, now the US election. Many model predictions were off by a wide margin, and are there lessons to be learned here? As late as 4 pm PST yesterday, the betting markets were still favoring Hillary 4 to 1. I take it that the betting…
horaceT
  • 3,162
  • 3
  • 15
  • 19
107
votes
4 answers

How to select kernel for SVM?

When using SVM, we need to select a kernel. I wonder how to select a kernel. Any criteria on kernel selection?
xiaohan2012
  • 6,819
  • 5
  • 18
  • 18
106
votes
17 answers

What is the role of the logarithm in Shannon's entropy?

Shannon's entropy is the negative of the sum of the probabilities of each outcome multiplied by the logarithm of probabilities for each outcome. What purpose does the logarithm serve in this equation? An intuitive or visual answer (as opposed to a…
106
votes
10 answers

Validation Error less than training error?

I found two questions here and here about this issue but there is no obvious answer or explanation yet.I enforce the same problem where the validation error is less than training error in my Convolution Neural Network. What does that mean?
106
votes
1 answer

Conditional inference trees vs traditional decision trees

Can anyone explain the primary differences between conditional inference trees (ctree from party package in R) compared to the more traditional decision tree algorithms (such as rpart in R)? What makes CI trees different? Strengths and…
B_Miner
  • 7,560
  • 20
  • 81
  • 144
106
votes
11 answers

"Best" series of colors to use for differentiating series in publication-quality plots

Has any study been done on what are the best set of colors to use for showing multiple series on the same plot? I've just been using the defaults in matplotlib, and they look a little childish since they're all bright, primary colors.
Daisy Sophia Hollman
  • 1,203
  • 2
  • 9
  • 7
105
votes
19 answers

How to annoy a statistical referee?

I recently asked a question regarding general principles around reviewing statistics in papers. What I would now like to ask, is what particularly irritates you when reviewing a paper, i.e. what's the best way to really annoy a statistical…
csgillespie
  • 11,849
  • 9
  • 56
  • 85
105
votes
7 answers

Is it necessary to scale the target value in addition to scaling features for regression analysis?

I'm building regression models. As a preprocessing step, I scale my feature values to have mean 0 and standard deviation 1. Is it necessary to normalize the target values also?
user2806363
  • 2,313
  • 3
  • 17
  • 27
104
votes
4 answers

What is the difference between zero-inflated and hurdle models?

I wonder if there is a clear-cut difference between the so-called zero-inflated distributions (models) and so-called hurdle-at-zero distributions (models)? The terms occur quite often in the literature and I suspect they are not the same, but would…
skulker
  • 1,268
  • 2
  • 9
  • 6
104
votes
13 answers

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series. These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic). I would…
gianluca
  • 1,921
  • 4
  • 16
  • 9