Most Popular

1500 questions
170
votes
3 answers

What is the difference between a consistent estimator and an unbiased estimator?

What is the difference between a consistent estimator and an unbiased estimator? The precise technical definitions of these terms are fairly complicated, and it's difficult to get an intuitive feel for what they mean. I can imagine a good estimator,…
MathematicalOrchid
  • 2,430
  • 3
  • 13
  • 15
170
votes
2 answers

A list of cost functions used in neural networks, alongside applications

What are common cost functions used in evaluating the performance of neural networks? Details (feel free to skip the rest of this question, my intent here is simply to provide clarification on notation that answers may use to help them be more…
Phylliida
  • 2,795
  • 5
  • 15
  • 19
169
votes
4 answers

Cohen's kappa in plain English

I am reading a data mining book and it mentioned the Kappa statistic as a means for evaluating the prediction performance of classifiers. However, I just can't understand this. I also checked Wikipedia but it didn't help too:…
Jack Twain
  • 7,781
  • 14
  • 48
  • 74
169
votes
16 answers

Are large data sets inappropriate for hypothesis testing?

In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of no effect is true — will declare a statistically…
167
votes
21 answers

Does Julia have any hope of sticking in the statistical community?

I recently read a post from R-Bloggers, that linked to this blog post from John Myles White about a new language called Julia. Julia takes advantage of a just-in-time compiler that gives it wicked fast run times and puts it on the same order of…
Christopher Aden
  • 1,775
  • 4
  • 24
  • 43
166
votes
5 answers

How to intuitively explain what a kernel is?

Many machine learning classifiers (e.g. support vector machines) allow one to specify a kernel. What would be an intuitive way of explaining what a kernel is? One aspect I have been thinking of is the distinction between linear and non-linear…
hashkey
  • 1,661
  • 3
  • 9
  • 3
162
votes
3 answers

Gradient Boosting Tree vs Random Forest

Gradient tree boosting as proposed by Friedman uses decision trees as base learners. I'm wondering if we should make the base decision tree as complex as possible (fully grown) or simpler? Is there any explanation for the choice? Random Forest is…
FihopZz
  • 1,923
  • 4
  • 11
  • 9
162
votes
5 answers

What's the difference between Normalization and Standardization?

At work we were discussing this as my boss has never heard of normalization. In Linear Algebra, Normalization seems to refer to the dividing of a vector by its length. And in statistics, Standardization seems to refer to the subtraction of a mean…
Chris
  • 1,629
  • 3
  • 11
  • 3
161
votes
11 answers

What is the difference between a neural network and a deep neural network, and why do the deep ones work better?

I haven't seen the question stated precisely in these terms, and this is why I make a new question. What I am interested in knowing is not the definition of a neural network, but understanding the actual difference with a deep neural network. For…
Nicolas
  • 1,781
  • 3
  • 10
  • 14
160
votes
9 answers

When is it ok to remove the intercept in a linear regression model?

I am running linear regression models and wondering what the conditions are for removing the intercept term. In comparing results from two different regressions where one has the intercept and the other does not, I notice that the $R^2$ of the…
analyticsPierce
  • 1,793
  • 3
  • 12
  • 6
160
votes
9 answers

Bottom to top explanation of the Mahalanobis distance?

I'm studying pattern recognition and statistics and almost every book I open on the subject I bump into the concept of Mahalanobis distance. The books give sort of intuitive explanations, but still not good enough ones for me to actually really…
159
votes
2 answers

Deriving the conditional distributions of a multivariate normal distribution

We have a multivariate normal vector ${\boldsymbol Y} \sim \mathcal{N}(\boldsymbol\mu, \Sigma)$. Consider partitioning $\boldsymbol\mu$ and ${\boldsymbol Y}$ into $$\boldsymbol\mu = \begin{bmatrix} \boldsymbol\mu_1 \\ …
Flying pig
  • 5,689
  • 11
  • 32
  • 31
159
votes
5 answers

What's the difference between principal component analysis and multidimensional scaling?

How are PCA and classical MDS different? How about MDS versus nonmetric MDS? Is there a time when you would prefer one over the other? How do the interpretations differ?
Stephen Turner
  • 4,183
  • 8
  • 27
  • 33
158
votes
9 answers

What exactly are keys, queries, and values in attention mechanisms?

How should one understand the keys, queries, and values that are often mentioned in attention mechanisms? I've tried searching online, but all the resources I find only speak of them as if the reader already knows what they are. Judging by the paper…
Sean
  • 2,184
  • 2
  • 9
  • 22
157
votes
3 answers

How are the standard errors of coefficients calculated in a regression?

For my own understanding, I am interested in manually replicating the calculation of the standard errors of estimated coefficients as, for example, come with the output of the lm() function in R, but haven't been able to pin it down. What is the…
ako
  • 1,673
  • 3
  • 11
  • 7