Most Popular
1500 questions
49
votes
5 answers
Generic sum of Gamma random variables
I have read that the sum of Gamma random variables with the same scale parameter is another Gamma random variable. I've also seen the paper by Moschopoulos describing a method for the summation of a general set of Gamma random variables. I have…

OSE
- 1,057
- 2
- 10
- 8
49
votes
3 answers
What is the difference between posterior and posterior predictive distribution?
I understand what a Posterior is, but I'm not sure what the latter means?
How are the 2 different?
Kevin P Murphy indicated in his textbook, Machine Learning: a Probabilistic Perspective, that it is "an internal belief state". What does that really…

A.D
- 2,114
- 3
- 17
- 27
49
votes
1 answer
Logistic regression: anova chi-square test vs. significance of coefficients (anova() vs summary() in R)
I have a logistic GLM model with 8 variables. I ran a chi-square test in R anova(glm.model,test='Chisq') and 2 of the variables turn out to be predictive when ordered at the top of the test and not so much when ordered at the bottom. The…

StreetHawk
- 493
- 1
- 5
- 5
49
votes
6 answers
What do "endogeneity" and "exogeneity" mean substantively?
I understand that the basic definition of endogeneity is that
$$
X'\epsilon=0
$$
is not satisfied, but what does this mean in a real world sense? I read the Wikipedia article, with the supply and demand example, trying to make sense of it, but it…

user25901
- 491
- 1
- 5
- 3
49
votes
5 answers
What is residual standard error?
When running a multiple regression model in R, one of the outputs is a residual standard error of 0.0589 on 95,161 degrees of freedom. I know that the 95,161 degrees of freedom is given by the difference between the number of observations in my…

ustroetz
- 741
- 1
- 8
- 14
49
votes
2 answers
Are splines overfitting the data?
My problem: I recently met a statistician that informed me that splines are only useful for exploring data and are subjected to overfitting, thus not useful in prediction. He preferred exploring with simple polynomials ... As I’m a big fan of…

Max Gordon
- 5,616
- 8
- 30
- 51
49
votes
4 answers
Does correlation = 0.2 mean that there is an association "in only 1 in 5 people"?
In The Idiot Brain: A Neuroscientist Explains What Your Head is Really Up To, Dean Burnett wrote
The correlation between height and intelligence is usually cited as
being about $0.2$, meaning height and intelligence seem to be associated in only…

Sitak
- 593
- 4
- 5
49
votes
6 answers
Is Amazon's "average rating" misleading?
If I understand correctly, book ratings on a 1-5 scale are Likert scores. That is, a 3 for me may not necessarily be a 3 for someone else. It's an ordinal scale IMO. One shouldn't really average ordinal scales but can definitely take the mode,…

PhD
- 13,429
- 19
- 45
- 47
49
votes
8 answers
Danger of setting all initial weights to zero in Backpropagation
Why is it dangerous to initialize weights with zeros? Is there any simple example that demonstrates it?

user8078
- 593
- 1
- 5
- 4
49
votes
6 answers
Why do we use ReLU in neural networks and how do we use it?
Why do we use rectified linear units (ReLU) with neural networks? How does that improve neural network?
Why do we say that ReLU is an activation function? Isn't softmax activation function for neural networks? I am guessing that we use both, ReLU…

user2896492634
- 593
- 1
- 5
- 4
49
votes
1 answer
How does the Adam method of stochastic gradient descent work?
I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION.
While I've definitely got some insights (at least), the paper seems to be too high…

daniel451
- 2,635
- 6
- 22
- 26
49
votes
5 answers
How to make a time series stationary?
Besides taking differences, what are other techniques for making a non-stationary time series, stationary?
Ordinarily one refers to a series as "integrated of order p" if it can be made stationary through a lag operator $(1-L)^P X_t$.

Shane
- 11,961
- 17
- 71
- 89
49
votes
3 answers
How does saddlepoint approximation work?
How does saddlepoint approximation work? What sort of problem is it good for?
(Feel free to use a particular example or examples by way of illustration)
Are there any drawbacks, difficulties, things to watch out for, or traps for the unwary?

Glen_b
- 257,508
- 32
- 553
- 939
49
votes
1 answer
Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)?
I've written a simple MLP in TensorFlow which is modelling a XOR-Gate.
So for:
input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]
it should produce the following:
output_data = [[0.], [1.], [1.], [0.]]
The network has an input layer, a hidden…

daniel451
- 2,635
- 6
- 22
- 26
49
votes
5 answers
How does rectilinear activation function solve the vanishing gradient problem in neural networks?
I found rectified linear unit (ReLU) praised at several places as a solution to the vanishing gradient problem for neural networks. That is, one uses max(0,x) as activation function. When the activation is positive, it is obvious that this is better…

Hans-Peter Störr
- 607
- 1
- 6
- 6