1

What I want to do is find the values for $X = $ { $x_j$ } that will produce the maximum $y$.

I'm currently trying to maximize my output $y$, based on my inputs $X$.

Say there are inputs, $X = $ { $x_1 , x_2 , x_3 \cdots , x_j $ }. Each example has all the features in $X$, and a $y$ value - ie one example is $ ( X_i, y_i) $

What I want to do is find the values for $X = $ { $x_j$ } that will produce the maximum $y$.

One (shitty) idea is create a neural network with $j$ input nodes in the first layer and have one single output node. Train the NN, which would let me predict the output $y$ based on $X$ which isn't helpful for my example. So I would then generate a bunch of randomized values for $X$ and find the value that outputs the largest $y$ which is obviously super inefficient.

Another (shitty) idea is to train the NN like above, but this time use some sort of Reinforcement learning to optimize the inputs. I don't know much about RL, but it seems to be unnecessary and not helpful in this situation because the training examples are varied which means the $y$ value wouldn't be optimized? (I don't know much about this specifically.)

Is there a specific model or algorithm that would let me find the maximum $y$ based on my $X$ values?

This question is pretty similar to this but here they don't actually give a specific model to follow.

NEW EDIT : As requested by @whuber a more clear explanation of the context. I am conducting an experiment that has many parameters I am able to vary. I can then develop a large(ish) dataset with each of the parameters/features somehow affecting my final output $y$. The inputs are my features/parameters $X =$ {$x_j$} and each example of $(X_i, y_i)$ contains a variation on the features called $X_i$ where $i$ is the number of training examples and $j$ is the number of features comprised in $X$. - I have no idea what the relationship between $X$ and $y$ so I don't know what $f$ maps from $X \mapsto y$

I want to know what I changes I can do to my parameters ($X =$ {$x_j$} ) in my experiment in order to maximize my output ($y$).

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Andre Fu
  • 111
  • 4
  • 1
    So you start with a dataset $\{(X_i, y_i)\}$, with $y_i \approx f(X_i)$ for some unknown function $f$, and you're trying to find the maximizer of $f$? (In particular, you can't get any more evaluations of $f$ as you go?) – Danica Jun 11 '18 at 17:16
  • 2
    This is just a specific re-statement of the general problem of black-box, non-convex optimization. All of the ideas in this thread are applicable. https://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193310#193310 – Sycorax Jun 11 '18 at 18:01
  • @Sycorax Not sure that's quite true, if you have a fixed dataset and can't get new evaluations. I think the idea of maximizing a surrogate function as suggested in OP's link makes sense, and is of course related.... – Danica Jun 11 '18 at 18:53
  • It's not clear to me whether the evaluations are fixed or not. – Sycorax Jun 11 '18 at 19:12
  • @Dougal essentially what your first comment was suggesting. I'm not familiar, what do you mean by I can't get any more evaluations of $f$ as I go? My dataset is fixed because it is relatively-costly to produce more $( X_i, y_i)$ examples. – Andre Fu Jun 11 '18 at 20:07
  • If you tell us nothing about $f$, then all you *really* have is a bunch of inputs, each of which has a $y$ value, and you seek an input with a maximal $y$ value. It isn't possible to be sure of the solution until you have inspected all the inputs: in other words, exhaustive enumeration is the only possible solution. Thus, as a general proposition, this question could be considered either too broad or too uninteresting to be answerable. Could you possibly focus it on some particular question of interest? – whuber Jun 11 '18 at 20:13
  • Re the edit: you likely know, or suspect, a great deal about how $y$ varies with the inputs. *That's crucial information.* Even supposing the variation is continuous is better than knowing nothing at all. Better than that are assumptions about differentiability or monotonicity or convexity or rates of change or types of interactions, etc. But if you truly "have no idea" then you are in no better situation than before. – whuber Jun 11 '18 at 20:30
  • @whuber you are correct - I can suspect the variation is continuous and thus must be some sort of diff'ble $f$ But that doesn't help me know which algorithm to use to find a maximum $y$ based on my $X$ – Andre Fu Jun 11 '18 at 20:39
  • Actually, it's tremendously helpful, because it means you can use algorithms that assume differentiability--and those tend to be more efficient than general-purpose algorithms. The better your characterization of the possible solutions $f$ becomes, the more you can narrow the choice of solution methods and the more likely it becomes you will find an effective and efficient one. Another issue, which we haven't even discussed, is whether repeated experimentation with the same inputs always produces the same output. Usually it does not, and assumptions about the variation are extremely useful. – whuber Jun 11 '18 at 20:42
  • @whuber Ideally, yes the same $X$ values should produce the same $y$ but due to human error during data collection there will probably be cases where the same values in two $X$ s produce the different $y$s – Andre Fu Jun 11 '18 at 22:12

2 Answers2

3

OP has answered this question by stating

I think I have found what I was looking for.

I can approximate my $ y = f(x_1, x_2, x_3, x_4, x_5, x_6) = g(x_1, x_2, x_3, x_4, x_5, x_6) + \varepsilon $ using an Artificial Neural Net, then from there using a genetic algorithm to optimize the output.

but I don't think that this is a good solution.

Using a neural network inherently introduces lots of hard problems: selecting how many neurons in what configuration, an initialization method, an activation function, an optimizer, a learning rate, regularization strategies (L1, L2, dropout, mixup, max norm...) -- to succeed, all of these things must be chosen to be "just right" for whatever problem you're trying to solve. Then the genetic algorithm step adds more complexity, and more tuning, on top of that.

If you generalize from using a neural network to any function approximation method, you are employing a surrogate surface, which you denote as $g$, and performing the optimization over that surface. There's no particular need to use a neural network.

One desirable attribute that you would want your surrogate surface interpolates well between your data points, neither changing values too quickly or too slowly. A well-studied category of surrogate surfaces for this type of problem is the Gaussian process; these methods are usually characterized in terms of a length-scale parameter which exactly controls the behavior between data points. In this sense, there is essentially only one "knob," the length-scale, that you have to fiddle with.

Once you're satisfied that you have a "good" surrogate surface, you can directly optimize it. The values $g$ and their gradients are easily computed and cheap to evaluate, so you can quite readily use multi-start optimization, or really any global optimization method, to find optima on the surface.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • 1
    this is really beneficial! I'm confused about the _surrogate surface, g_ How would I go about finding a good surrogate surface? What does this have to do with the Gaussian Process? – Andre Fu Jun 19 '18 at 02:47
  • 1
    A Gaussian process is my suggestion of how to estimate a surrogate surface; by contrast, your answer suggests using a neural network instead. If you've found my answer helpful, please consider up-voting and/or accepting my answer. – Sycorax Jun 19 '18 at 14:41
0

I think I have found what I was looking for.

I can approximate my $ y = f(x_1, x_2, x_3, x_4, x_5, x_6) = g(x_1, x_2, x_3, x_4, x_5, x_6) + \varepsilon $ using an Artificial Neural Net, then from there using a genetic algorithm to optimize the output.

Andre Fu
  • 111
  • 4