Most Popular

1500 questions
49
votes
5 answers

Generic sum of Gamma random variables

I have read that the sum of Gamma random variables with the same scale parameter is another Gamma random variable. I've also seen the paper by Moschopoulos describing a method for the summation of a general set of Gamma random variables. I have…
49
votes
3 answers

What is the difference between posterior and posterior predictive distribution?

I understand what a Posterior is, but I'm not sure what the latter means? How are the 2 different? Kevin P Murphy indicated in his textbook, Machine Learning: a Probabilistic Perspective, that it is "an internal belief state". What does that really…
A.D
  • 2,114
  • 3
  • 17
  • 27
49
votes
1 answer

Logistic regression: anova chi-square test vs. significance of coefficients (anova() vs summary() in R)

I have a logistic GLM model with 8 variables. I ran a chi-square test in R anova(glm.model,test='Chisq') and 2 of the variables turn out to be predictive when ordered at the top of the test and not so much when ordered at the bottom. The…
49
votes
6 answers

What do "endogeneity" and "exogeneity" mean substantively?

I understand that the basic definition of endogeneity is that $$ X'\epsilon=0 $$ is not satisfied, but what does this mean in a real world sense? I read the Wikipedia article, with the supply and demand example, trying to make sense of it, but it…
user25901
  • 491
  • 1
  • 5
  • 3
49
votes
5 answers

What is residual standard error?

When running a multiple regression model in R, one of the outputs is a residual standard error of 0.0589 on 95,161 degrees of freedom. I know that the 95,161 degrees of freedom is given by the difference between the number of observations in my…
ustroetz
  • 741
  • 1
  • 8
  • 14
49
votes
2 answers

Are splines overfitting the data?

My problem: I recently met a statistician that informed me that splines are only useful for exploring data and are subjected to overfitting, thus not useful in prediction. He preferred exploring with simple polynomials ... As I’m a big fan of…
Max Gordon
  • 5,616
  • 8
  • 30
  • 51
49
votes
4 answers

Does correlation = 0.2 mean that there is an association "in only 1 in 5 people"?

In The Idiot Brain: A Neuroscientist Explains What Your Head is Really Up To, Dean Burnett wrote The correlation between height and intelligence is usually cited as being about $0.2$, meaning height and intelligence seem to be associated in only…
Sitak
  • 593
  • 4
  • 5
49
votes
6 answers

Is Amazon's "average rating" misleading?

If I understand correctly, book ratings on a 1-5 scale are Likert scores. That is, a 3 for me may not necessarily be a 3 for someone else. It's an ordinal scale IMO. One shouldn't really average ordinal scales but can definitely take the mode,…
PhD
  • 13,429
  • 19
  • 45
  • 47
49
votes
8 answers

Danger of setting all initial weights to zero in Backpropagation

Why is it dangerous to initialize weights with zeros? Is there any simple example that demonstrates it?
user8078
  • 593
  • 1
  • 5
  • 4
49
votes
6 answers

Why do we use ReLU in neural networks and how do we use it?

Why do we use rectified linear units (ReLU) with neural networks? How does that improve neural network? Why do we say that ReLU is an activation function? Isn't softmax activation function for neural networks? I am guessing that we use both, ReLU…
user2896492634
  • 593
  • 1
  • 5
  • 4
49
votes
1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…
daniel451
  • 2,635
  • 6
  • 22
  • 26
49
votes
5 answers

How to make a time series stationary?

Besides taking differences, what are other techniques for making a non-stationary time series, stationary? Ordinarily one refers to a series as "integrated of order p" if it can be made stationary through a lag operator $(1-L)^P X_t$.
Shane
  • 11,961
  • 17
  • 71
  • 89
49
votes
3 answers

How does saddlepoint approximation work?

How does saddlepoint approximation work? What sort of problem is it good for? (Feel free to use a particular example or examples by way of illustration) Are there any drawbacks, difficulties, things to watch out for, or traps for the unwary?
49
votes
1 answer

Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)?

I've written a simple MLP in TensorFlow which is modelling a XOR-Gate. So for: input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] it should produce the following: output_data = [[0.], [1.], [1.], [0.]] The network has an input layer, a hidden…
49
votes
5 answers

How does rectilinear activation function solve the vanishing gradient problem in neural networks?

I found rectified linear unit (ReLU) praised at several places as a solution to the vanishing gradient problem for neural networks. That is, one uses max(0,x) as activation function. When the activation is positive, it is obvious that this is better…