Most Popular
1500 questions
55
votes
8 answers
Book for reading before Elements of Statistical Learning?
Based on this post, I want to digest Elements of Statistical Learning. Fortunately it is available for free and I started reading it.
I don't have enough knowledge to understand it. Can you recommend a book that is a better introduction to the…

B Seven
- 2,873
- 4
- 24
- 29
55
votes
4 answers
How does LSTM prevent the vanishing gradient problem?
LSTM was invented specifically to avoid the vanishing gradient problem. It is supposed to do that with the Constant Error Carousel (CEC), which on the diagram below (from Greff et al.) correspond to the loop around cell.
(source:…

TheWalkingCube
- 653
- 1
- 6
- 6
55
votes
2 answers
Intuitive explanations of differences between Gradient Boosting Trees (GBM) & Adaboost
I'm trying to understand the differences between GBM & Adaboost.
These are what I've understood so far:
There are both boosting algorithms, which learns from previous model's errors and finally make a weighted sum of the models.
GBM and Adaboost…

Hee Kyung Yoon
- 687
- 1
- 6
- 9
55
votes
4 answers
Why sigmoid function instead of anything else?
Why is the de-facto standard sigmoid function, $\frac{1}{1+e^{-x}}$, so popular in (non-deep) neural-networks and logistic regression?
Why don't we use many of the other derivable functions, with faster computation time or slower decay (so…

Mark Horvath
- 795
- 1
- 8
- 9
55
votes
2 answers
Prediction interval for lmer() mixed effects model in R
I want to get a prediction interval around a prediction from a lmer() model. I have found some discussion about this:
http://rstudio-pubs-static.s3.amazonaws.com/24365_2803ab8299934e888a60e7b16113f619.html
http://glmm.wikidot.com/faq
but they seem…

hossibley
- 797
- 2
- 8
- 10
55
votes
2 answers
What is maxout in neural network?
Can anyone explain what maxout units in a neural network do? How do they perform and how do they differ from conventional units?
I tried to read the 2013 "Maxout Network" paper by Goodfellow et al. (from Professor Yoshua Bengio's group), but I don't…

RockTheStar
- 11,277
- 31
- 63
- 89
54
votes
4 answers
Why do statisticians say a non-significant result means "you can't reject the null" as opposed to accepting the null hypothesis?
Traditional statistical tests, like the two sample t-test, focus on trying to eliminate the hypothesis that there is no difference between a function of two independent samples. Then, we choose a confidence level and say that if the difference of…

ryu576
- 2,220
- 1
- 16
- 25
54
votes
5 answers
What is the difference between NaN and NA?
I would like to know why some languages like R has both NA and NaN. What are the differences or are they equally the same? Is it really needed to have NA?

user2479
- 641
- 1
- 5
- 3
54
votes
13 answers
Visually interesting statistics concepts that are easy to explain
I noticed on Math Stack Exchange a terrific thread which highlighted a number of very visually interesting math concepts. I would be curious to see graphics/gifs which anyone has that very clearly illustrate a statistics concept (particularly those…

David Veitch
- 947
- 6
- 12
54
votes
3 answers
Multivariate linear regression vs neural network?
It seems that it is possible to get similar results to a neural network with a multivariate linear regression in some cases, and multivariate linear regression is super fast and easy.
Under what circumstances can neural networks give better results…

Hugh Perkins
- 4,279
- 1
- 23
- 38
54
votes
10 answers
What is a good algorithm for estimating the median of a huge read-once data set?
I'm looking for a good algorithm (meaning minimal computation, minimal storage requirements) to estimate the median of a data set that is too large to store, such that each value can only be read once (unless you explicitly store that value). There…

PeterR
- 1,712
- 1
- 16
- 13
54
votes
3 answers
Is it possible to do time-series clustering based on curve shape?
I have sales data for a series of outlets, and want to categorise them based on the shape of their curves over time. The data looks roughly like this (but obviously isn't random, and has some missing data):
n.quarters <- 100
n.stores <- 20
if…

fmark
- 4,666
- 5
- 35
- 51
54
votes
10 answers
Why is the sum of two random variables a convolution?
For long time I did not understand why the "sum" of two random variables is their convolution, whereas a mixture density function sum of $f(x)$ and $g(x)$ is $p\,f(x)+(1-p)g(x)$; the arithmetic sum and not their convolution. The exact phrase "the…

Carl
- 11,532
- 7
- 45
- 102
54
votes
8 answers
Modern successor to Exploratory Data Analysis by Tukey?
I've been reading Tukey's book "Exploratory Data Analysis". Being written in 1977, the book emphasizes paper/pencil methods. Is there a more 'modern' successor which takes into account that we can now instantaneosly plot large data sets?

biofreezer
- 255
- 4
- 11
54
votes
7 answers
Is it a good practice to always scale/normalize data for machine learning?
My understanding is that when some features have different ranges in their values (for example, imagine one feature being the age of a person and another one being their salary in USD) will affect negatively algorithms because the feature with…

Juan Antonio Gomez Moriano
- 1,171
- 1
- 12
- 16