Most Popular

1500 questions
66
votes
7 answers

Which activation function for output layer?

While the choice of activation functions for the hidden layer is quite clear (mostly sigmoid or tanh), I wonder how to decide on the activation function for the output layer. Common choices are linear functions, sigmoid functions and softmax…
Funkwecker
  • 2,432
  • 5
  • 24
  • 43
66
votes
7 answers

How much to pay? A practical problem

This is not a home work question but real problem faced by our company. Very recently (2 days ago) we ordered for manufacturing of 10000 product labels to a dealer. Dealer is independent person. He gets the labels manufactured from outside and…
Neeraj
  • 2,150
  • 18
  • 28
66
votes
5 answers

How to statistically compare two time series?

I have two time series, shown in the plot below: The plot is showing the full detail of both time series, but I can easily reduce it to just the coincident observations if needed. My question is: What statistical methods can I use to assess the…
robintw
  • 1,977
  • 4
  • 24
  • 23
66
votes
3 answers

Maximum likelihood method vs. least squares method

What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimaton (LSE) ? Why can't we use MLE for predicting $y$ values in linear regression and vice versa? Any help on this topic will be greatly appreciated.
evros
  • 751
  • 2
  • 7
  • 6
66
votes
5 answers

Why do we minimize the negative likelihood if it is equivalent to maximization of the likelihood?

This question has puzzled me for a long time. I understand the use of 'log' in maximizing the likelihood so I am not asking about 'log'. My question is, since maximizing log likelihood is equivalent to minimizing "negative log likelihood" (NLL), why…
Tony
  • 1,583
  • 4
  • 15
  • 20
66
votes
12 answers

What does orthogonal mean in the context of statistics?

In other contexts, orthogonal means "at right angles" or "perpendicular". What does orthogonal mean in a statistical context? Thanks for any clarifications.
pmgjones
  • 5,543
  • 8
  • 36
  • 36
66
votes
3 answers

Why does ridge estimate become better than OLS by adding a constant to the diagonal?

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$ $$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda…
Heisenberg
  • 4,239
  • 3
  • 23
  • 54
66
votes
4 answers

Random Forest - How to handle overfitting

I have a computer science background but am trying to teach myself data science by solving problems on the internet. I have been working on this problem for the last couple of weeks (approx 900 rows and 10 features). I was initially using logistic…
Abhi
  • 1,269
  • 3
  • 13
  • 17
66
votes
4 answers

Intuitive explanation of Fisher Information and Cramer-Rao bound

I am not comfortable with Fisher information, what it measures and how is it helpful. Also it's relationship with the Cramer-Rao bound is not apparent to me. Can someone please give an intuitive explanation of these concepts?
Infinity
  • 893
  • 1
  • 8
  • 7
65
votes
2 answers

Why only three partitions? (training, validation, test)

When you are trying to fit models to a large dataset, the common advice is to partition the data into three parts: the training, validation, and test dataset. This is because the models usually have three "levels" of parameters: the first…
charles.y.zheng
  • 7,346
  • 2
  • 28
  • 32
65
votes
2 answers

What is the difference between a partial likelihood, profile likelihood and marginal likelihood?

I see these terms being used and I keep getting them mixed up. Is there a simple explanation of the differences between them?
Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
65
votes
6 answers

Real-life examples of moving average processes

Can you give some real-life examples of time series for which a moving average process of order $q$, i.e. $$ y_t = \sum_{i=1}^q \theta_i \varepsilon_{t-i} + \varepsilon_t, \text{ where } \varepsilon_t \sim \mathcal{N}(0, \sigma^2) $$ has some a…
weez13
  • 1,127
  • 2
  • 9
  • 12
65
votes
14 answers

What is the most surprising characterization of the Gaussian (normal) distribution?

A standardized Gaussian distribution on $\mathbb{R}$ can be defined by giving explicitly its density: $$ \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$$ or its characteristic function. As recalled in this question it is also the only distribution for which the…
65
votes
12 answers

Why do neural networks need so many training examples to perform?

A human child at age 2 needs around 5 instances of a car to be able to identify it with reasonable accuracy regardless of color, make, etc. When my son was 2, he was able to identify trams and trains, even though he had seen just a few. Since he was…
Marcin
  • 917
  • 1
  • 7
  • 11
65
votes
4 answers

Does it make sense to add a quadratic term but not the linear term to a model?

I have a (mixed) model in which one of my predictors should a priori only be quadratically related to the predictor (due to the experimental manipulation). Hence, I would like to add only the quadratic term to the model. Two things keep me from…
Henrik
  • 13,314
  • 9
  • 63
  • 123