Questions tagged [optimization]

Use this tag for any use of optimization within statistics.

In statistics, optimization if often used to select an estimator of a parameter by maximizing or minimizing some function of the data. One very common example of optimization in statistics is choosing an estimator which maximizes the joint density (or mass function) of the observed data; this is known as Maximum Likelihood Estimation (MLE).

Optimization is a large field and has many uses outside of estimation. In design of experiments we can choose a design by maximizing a certain determinant (D-optimality). In industrial statistics we can use a fitted response-surface model to optimize the economy or yield of a process. Wikipedia has an article https://en.wikipedia.org/wiki/Mathematical_optimization .

2533 questions
219
votes
9 answers

Why is Newton's method not widely used in machine learning?

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes: After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient…
Fei Yang
  • 2,181
  • 3
  • 8
  • 4
139
votes
5 answers

Batch gradient descent versus stochastic gradient descent

Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…
114
votes
2 answers

tanh activation function vs sigmoid activation function

The tanh activation function is: $$tanh \left( x \right) = 2 \cdot \sigma \left( 2 x \right) - 1$$ Where $\sigma(x)$, the sigmoid function, is defined as: $$\sigma(x) = \frac{e^x}{1 + e^x}$$. Questions: Does it really matter between using those…
satya
  • 1,293
  • 2
  • 9
  • 9
112
votes
6 answers

Is it possible to train a neural network without backpropagation?

Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient. Let's assume we are building a model with ~10K parameters / weights. Is it possible to run the…
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
91
votes
7 answers

Why to optimize max log probability instead of probability

In most machine learning tasks where you can formulate some probability $p$ which should be maximised, we would actually optimize the log probability $\log p$ instead of the probability for some parameters $\theta$. E.g. in maximum likelihood…
Albert
  • 1,145
  • 1
  • 9
  • 12
75
votes
6 answers

What is an intuitive explanation for how PCA turns from a geometric problem (with distances) to a linear algebra problem (with eigenvectors)?

I've read a lot about PCA, including various tutorials and questions (such as this one, this one, this one, and this one). The geometric problem that PCA is trying to optimize is clear to me: PCA tries to find the first principal component by…
stackoverflowuser2010
  • 3,190
  • 5
  • 27
  • 35
73
votes
6 answers

Optimization when Cost Function Slow to Evaluate

Gradient descent and many other methods are useful for finding local minima in cost functions. They can be efficient when the cost function can be evaluated quickly at each point, whether numerically or analytically. I have what appears to me to…
71
votes
4 answers

What's the difference between momentum based gradient descent and Nesterov's accelerated gradient descent?

So momentum based gradient descent works as follows: $v=\beta m-\eta g$ where $m$ is the previous weight update, and $g$ is the current gradient with respect to the parameters $p$, $\eta$ is the learning rate, and $\beta$ is a constant. $p_{new} = p…
applecider
  • 1,175
  • 2
  • 11
  • 13
61
votes
4 answers

Comparing SVM and logistic regression

Can someone please give me some intuition as to when to choose either SVM or LR? I want to understand the intuition behind what is the difference between the optimization criteria of learning the hyperplane of the two, where the respective aims are…
user41799
  • 661
  • 1
  • 6
  • 5
56
votes
6 answers

Practical hyperparameter optimization: Random vs. grid search

I'm currently going through Bengio's and Bergstra's Random Search for Hyper-Parameter Optimization [1] where the authors claim random search is more efficient than grid search in achieving approximately equal performance. My question is: Do people…
Bar
  • 2,492
  • 3
  • 19
  • 31
52
votes
1 answer

Do we have to tune the number of trees in a random forest?

Software implementations of random forest classifiers have a number of parameters to allow users to fine-tune the algorithm's behavior, including the number of trees $T$ in the forest. Is this a parameter that needs to be tuned, in the same way as…
Sycorax
  • 76,417
  • 20
  • 189
  • 313
52
votes
1 answer

Understanding "almost all local minimum have very similar function value to the global optimum"

In a recent blog post by Rong Ge, it was said that: It is believed that for many problems including learning deep nets, almost all local minimum have very similar function value to the global optimum, and hence finding a local minimum is good…
49
votes
1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…
daniel451
  • 2,635
  • 6
  • 22
  • 26
47
votes
1 answer

Neural Networks: weight change momentum and weight decay

Momentum $\alpha$ is used to diminish the fluctuations in weight changes over consecutive iterations: $$\Delta\omega_i(t+1) = - \eta\frac{\partial E}{\partial w_i} + \alpha \Delta \omega_i(t),$$ where $E({\bf w})$ is the error function, ${\bf w}$ -…
40
votes
1 answer

PCA objective function: what is the connection between maximizing variance and minimizing error?

The PCA algorithm can be formulated in terms of the correlation matrix (assume the data $X$ has already been normalized and we are only considering projection onto the first PC). The objective function can be written as: $$ \max_w (Xw)^T(Xw)\; \:…
Cam.Davidson.Pilon
  • 11,476
  • 5
  • 47
  • 75
1
2 3
99 100