5

Given a lot of training data, x and y, where y is the position of the maximum value of x. Can I train a regression model (e.g., logistic, linear) to approximate the POSITION of the maximum value of x?

For example, If x = [1 3 4 10 2 -1 -2], then y will be 4 (because the 4th position has a value of 10, which is the maximum value).

If we do max(x), then get 10. If we do maxpos(x), then get 4. <-- I want to use regression model to approximate the function maxpos.

Thanks!

RockTheStar
  • 11,277
  • 31
  • 63
  • 89
  • 2
    Any assumptions about the form of the curve? Theoretically, you could fit the derivative of the curve and find where it equals zero (assuming unimodal curve) –  Mar 29 '16 at 21:50
  • 1
    No. This is not how regression works. In regression, every value of X has an associated value of Y. Height and weight; IQ and age; whatever. So, please post some sample data and tell us what you want to find out... OK, now I see one data point. But I am still not at all clear what you want. – Peter Flom Mar 29 '16 at 21:52
  • I describe a profile likelihood method to obtain a confidence interval for such an estimate in an answer at http://stats.stackexchange.com/a/40609/919 . – whuber Mar 29 '16 at 21:59
  • @bey Hmm...I never thought about derivative, but can a regression model approximate derivative of a function? – RockTheStar Mar 29 '16 at 22:27
  • @PeterFlom what I want is the POSITION of the maximum value of an array, x. in my example, I have an array of x with 7 points. max(x) will give me 10, but the position of 10 is 4. I want to make a regression model to get me the 4, i.e. maxpos(x)=4. – RockTheStar Mar 29 '16 at 22:29
  • @whuber Thanks. Can you elaborate on it? – RockTheStar Mar 29 '16 at 22:30
  • @RockTheStar depends on the family of models you are considering (hence my first comment). If its reasonably smooth, you could use something like LOWESS to get derivatives from your (x,y) data, then take the theoretical derivative of your family of models, and then find the closest fit from that family of derivatives to the derivative of the LOWESS curve. However, I'd take a look at whuber's response...it appears more statistically motivated. –  Mar 29 '16 at 23:02
  • 1
    i don't think that makes any sense. It's not regression. – Peter Flom Mar 29 '16 at 23:51
  • 1
    Your title asks for the maximum of a curve but your example judges it by the largest observation. Unless you have essentially no noise (in which case, you're probably in the wrong place here), the two are [*not* the same thing](http://i.imgur.com/NvJ0ags.png). – Glen_b Mar 30 '16 at 04:25
  • As others have pointed out, you are not really describing a regression problem...not even sure if its a statistical problem. However, a valid statistical problem would be to view your data as a time series and then train a logistic regression (for example) to tell you the probability that the next point will be the maximum point. That is a clearly defined problem that could be tackled. –  Mar 30 '16 at 12:44
  • OK, then do you know how to do logistic regression? If not, there are a ton of resources out there. Also, you should ask a new question with a more focused title and problem statement. –  Mar 30 '16 at 18:33

2 Answers2

3

Mode estimation is a big fat unknown in the statistical world. It is an unsupervised process, so regression would have to be framed carefully. Obviously, with discrete $X$ you can take your $Y$ to be counts and fit a loglinear model to those data. But your example, where $Y$ is magically taken to be $4$ which is the mode of the $X$ sample... that makes no sense.

The best nonparametric way that I'm aware of estimating the mode from a sample is using a density smoother with kernal Boxcar, Gaussian, or Epanechnikov. I think it's $n^{1/5}$ efficient, which is to say incredibly slow.

Otherwise, maximum likelihood is a fully parametric way. If you assume a certain density, like normal, gamma, double exponential, or even Cauchy, you need only maximize a likelihood for the parametric density, and calculate the mode based on your distributional assumptions.

EDIT

One can use nonparametric methods with splines to fit a general non-linear trend, calculate the maximum from the fitted mean, and invert the maximum to obtain the $X$ value. For a small dataset, this doesn't make sense, but with hundreds of observations, continuously valued, it makes sense.

Example:

set.seed(1)
x <- seq(-3, 3, by=0.1)
f <- function(x) .3*x^3 - 4*log(abs(x)) 
curve(f, from=-3, to=3)
y <- rnorm(length(x), f(x), 2)
points(x, y)
l <- loess(y ~ x, subset = is.finite(y))
abline(v=x[which.max(predict(l))], col='red')
legend('bottomright', lty=1, col='red', 'Predicted max', bty='n', cex=0.5)

enter image description here

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • 1
    I suspect your interpretation of the question might not be the one intended. *Position* 4 was selected because the *value* there, equal to 10, is the largest of the group, *not* because the value 4 appeared twice. In other words, given a set of $(y,x)$ data (I adopt the reversed coordinates named in the question), the OP appears to wish to fit a model $E(x)=f(y)$ and identify the $y^{*}$ for which $\hat f(y^{*})$ is greatest. It is also important to obtain a standard error or confidence interval for $y^{*}$. I don't think distribution fitting or finding modes are involved here. – whuber Mar 29 '16 at 22:04
  • @whuber you're right. – AdamO Mar 29 '16 at 22:06
1

Disclaimer: I never thought of this problem previously. Just did now, and after ~30 minutes of thinking, my brain managed to spit an answer. Scrutinize please.

TL;DR

Depends on the process that generates those numbers. For some processes, it could be possible that (say) when the 2nd component has value between 1 and 2, then the component of the largest value is always the 7th. In those cases, it would be possible to find useful regression models up to some degree that is dependent on the process.

But on the other hand, if the process is totally random, then in my view we a regression can never do do anything better than this trivial function:

if input vector seen identically in the learning set:
    return that vector
else:
    return a random number from {1,2,...,d}

There are many problems that can be made that share this unfriendly property. For example, can a regression function figure out $a+b$ if given many values of $a$, $b$ and their answers? Your question is essentially similar: can we use regression to learn about an operator?

I cannot think of a way of solving those problems by regression without cheating (e.g. by adding fancy new components in the samples).


Let's say that here is your training samples: \begin{equation} \mathbf{X} = \begin{pmatrix} 1& 3 &4 &10& 2 &-1 &-2\\ 2& 66 &3 &3& 3 &3 &3\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 5& 4 &0 &0& 9 &7 &2\\ \end{pmatrix} \end{equation} with the following target variables that correspond to the training samples (which specify the position of the component/dimension that has the highest value): \begin{equation} \mathbf{Y} = \begin{pmatrix} 4\\ 2\\ \vdots\\ 6\\ \end{pmatrix} \end{equation}

So $x_i = (x_{i,1},x_{i,2}, \ldots, x_{i,d})$ is some training sample of yours (a $d$ dimensional vector). And since you have $n$ many samples, then if you stack those vectors atop each other you get an $n \times d$ matrix.

An optimistic estimator of the position:

Let's say that, by analysing your learning samples and their target variables (which is only $n$ samples), you managed to obtain the following estimated PDFs:

  • $\hat f_{X}$ PDF of random variable $X$; $X$ takes values in set of vectors $\{(\mathbf{x}_{i,1},\mathbf{x}_{i,2},\ldots,\mathbf{x}_{i,d}):1 \le i \le n\}$, where $\mathbf{x}_{i,j}$ is some number in matrix $\mathbf{X}$.
  • $\hat f_{X,Y}$ joint PDF of random variables $X$ and $Y$; $Y$ takes values in set of target values $\{1,2,\ldots,d\}$.

Then you get the expected value of the position of the component of the largest value is by: \begin{equation} \sum_{y=1}^d y\frac{\hat f_{X,Y}(\mathbf{x},y)}{\hat f_{X}(\mathbf{x})} \end{equation}

The accuracy depends on how lucky we are in finding good estimations of the PDFs. The question is: how lucky can we be when we only have $n$ samples?

In my view, the answer is: it also depends on the process that generates those vectors. If we are lucky and the process is one that leaves a lot of correlation between the vector component values and the target variable, then this naive estimator could achieve 100% accuracy.

Another estimator of the position under more pessimistic assumptions

Let's go almost extreme:

  • The process that generates those vectors is random.
  • But component values can only be in $\{-10,-9,\ldots,-1,0,1,\ldots,9,10\}$. I.e. we are dealing with PMFs instead of PDFs (cause it's all discrete now).

I can't think if a proof now (sleep time here; or maybe I am wrong), but I strongly feel at the moment is that perfectly accurate estimations of PMFs of such random process can only exist when learning samples (i.e. those $n$ vectors) cover all possible sample vector values.

The smallest possible value for $n$ can be achieved only if learning samples are chosen at perfect randomness (uniform distribution) from the population. When this perfect random sampling happens, then smallest value for $n$ is $21^d$.

So if learning samples are perfectly uniform, then $n=21^d$ must be true in order to allow for finding perfect PMFs that can perfectly find the position of the component of the largest value.

In other words, while this is computationally feasible, it is far more computationally efficient to just a linked list of unique vectors instead! Which defeats the point using statistical methods to predict.

Yet another estimator under even more pessimistic assumptions

Let's go more extreme:

  • The process that generates those vectors is random.
  • Component values can take any value in $\mathbb{R}$!

Then if sampling is perfectly random, $n$ must be a number that is uncountable infinite in order to perfectly predict the function.

This is even worse than the previous case: it's computationally infeasible.

caveman
  • 2,431
  • 1
  • 16
  • 32
  • 1
    You might want to get some background in this question by reading about [response surface methodology](https://en.wikipedia.org/wiki/Response_surface_methodology). – whuber Mar 30 '16 at 01:48
  • @caveman. Thanks for the details explanation. I am a bit confused about the "An optimistic estimator of the position", can you elaborate on that? – RockTheStar Mar 30 '16 at 18:20
  • @RockTheStar sure which part is the unclear one? (@whuber thank you dude I am reading about that) – caveman Mar 30 '16 at 18:29
  • @caveman the "An optimistic estimator of the position" – RockTheStar Apr 01 '16 at 20:43
  • @RockTheStar it's optimistic cause it assumes that the process that generates the component values of the $d$ dimensional vector is not random. For example, every time the $i^{th}$ component has the highest value the $j^{th} = c_i$ where $c_i$ is some constant. In other words, we are assuming that the process that generates those vectors and their labels is also trying to ensure that when component $i$ is the largest component, then there is some other component $j$ that has some specific value. – caveman Apr 01 '16 at 21:37