Questions tagged [optimization]

Use this tag for any use of optimization within statistics.

In statistics, optimization if often used to select an estimator of a parameter by maximizing or minimizing some function of the data. One very common example of optimization in statistics is choosing an estimator which maximizes the joint density (or mass function) of the observed data; this is known as Maximum Likelihood Estimation (MLE).

Optimization is a large field and has many uses outside of estimation. In design of experiments we can choose a design by maximizing a certain determinant (D-optimality). In industrial statistics we can use a fitted response-surface model to optimize the economy or yield of a process. Wikipedia has an article https://en.wikipedia.org/wiki/Mathematical_optimization .

2533 questions

219

votes

9 answers

Why is Newton's method not widely used in machine learning?

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes: After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient…

asked Dec 29 '16 at 01:00

Fei Yang

2,181
3
8
4

139

votes

5 answers

Batch gradient descent versus stochastic gradient descent

Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…

optimization gradient-descent stochastic-gradient-descent

asked Feb 07 '13 at 19:34

user20616

1,431
3
11
7

114

votes

2 answers

tanh activation function vs sigmoid activation function

The tanh activation function is: $$tanh \left( x \right) = 2 \cdot \sigma \left( 2 x \right) - 1$$ Where $\sigma(x)$, the sigmoid function, is defined as: $$\sigma(x) = \frac{e^x}{1 + e^x}$$. Questions: Does it really matter between using those…

machine-learning neural-networks optimization

asked Jun 08 '14 at 06:11

satya

1,293
2
9
9

112

votes

6 answers

Is it possible to train a neural network without backpropagation?

Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient. Let's assume we are building a model with ~10K parameters / weights. Is it possible to run the…

machine-learning neural-networks optimization backpropagation

asked Sep 20 '16 at 01:48

Haitao Du

32,885
17
118
213

votes

7 answers

Why to optimize max log probability instead of probability

In most machine learning tasks where you can formulate some probability $p$ which should be maximised, we would actually optimize the log probability $\log p$ instead of the probability for some parameters $\theta$. E.g. in maximum likelihood…

probability optimization likelihood

asked Sep 28 '15 at 08:37

Albert

1,145
1
9
12

votes

6 answers

What is an intuitive explanation for how PCA turns from a geometric problem (with distances) to a linear algebra problem (with eigenvectors)?

I've read a lot about PCA, including various tutorials and questions (such as this one, this one, this one, and this one). The geometric problem that PCA is trying to optimize is clear to me: PCA tries to find the first principal component by…

pca optimization linear-algebra intuition

asked Jun 08 '16 at 22:20

stackoverflowuser2010

3,190
5
27
35

votes

6 answers

Optimization when Cost Function Slow to Evaluate

Gradient descent and many other methods are useful for finding local minima in cost functions. They can be efficient when the cost function can be evaluated quickly at each point, whether numerically or analytically. I have what appears to me to…

gradient-descent optimization bayesian-optimization

asked Jan 31 '16 at 04:04

Jared Becksfort

votes

4 answers

What's the difference between momentum based gradient descent and Nesterov's accelerated gradient descent?

So momentum based gradient descent works as follows: $v=\beta m-\eta g$ where $m$ is the previous weight update, and $g$ is the current gradient with respect to the parameters $p$, $\eta$ is the learning rate, and $\beta$ is a constant. $p_{new} = p…

optimization gradient-descent

asked Nov 03 '15 at 06:51

applecider

1,175
2
11
13

votes

4 answers

Comparing SVM and logistic regression

Can someone please give me some intuition as to when to choose either SVM or LR? I want to understand the intuition behind what is the difference between the optimization criteria of learning the hyperplane of the two, where the respective aims are…

regression logistic svm optimization

asked Apr 26 '14 at 23:01

user41799

votes

6 answers

Practical hyperparameter optimization: Random vs. grid search

I'm currently going through Bengio's and Bergstra's Random Search for Hyper-Parameter Optimization [1] where the authors claim random search is more efficient than grid search in achieving approximately equal performance. My question is: Do people…

machine-learning hyperparameter optimization

asked Jul 08 '15 at 14:25

Bar

2,492
3
19
31

votes

1 answer

Do we have to tune the number of trees in a random forest?

Software implementations of random forest classifiers have a number of parameters to allow users to fine-tune the algorithm's behavior, including the number of trees $T$ in the forest. Is this a parameter that needs to be tuned, in the same way as…

classification optimization random-forest hyperparameter

asked May 25 '18 at 19:20

Sycorax

76,417
20
189
313

votes

1 answer

Understanding "almost all local minimum have very similar function value to the global optimum"

In a recent blog post by Rong Ge, it was said that: It is believed that for many problems including learning deep nets, almost all local minimum have very similar function value to the global optimum, and hence finding a local minimum is good…

machine-learning neural-networks optimization deep-learning

asked Mar 23 '16 at 17:02

John Donn

votes

1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…

neural-networks optimization gradient-descent adam

asked Jun 24 '16 at 15:45

daniel451

2,635
6
22
26

votes

1 answer

Neural Networks: weight change momentum and weight decay

Momentum $\alpha$ is used to diminish the fluctuations in weight changes over consecutive iterations: $$\Delta\omega_i(t+1) = - \eta\frac{\partial E}{\partial w_i} + \alpha \Delta \omega_i(t),$$ where $E({\bf w})$ is the error function, ${\bf w}$ -…

neural-networks optimization regularization gradient-descent

asked Sep 16 '13 at 01:56

Oleg Shirokikh

votes

1 answer

PCA objective function: what is the connection between maximizing variance and minimizing error?

The PCA algorithm can be formulated in terms of the correlation matrix (assume the data $X$ has already been normalized and we are only considering projection onto the first PC). The objective function can be written as: $$ \max_w (Xw)^T(Xw)\; \:…

pca optimization

asked Jul 12 '12 at 15:09

Cam.Davidson.Pilon

11,476
5
47
75

2 3

…

99 100 Next