4

I implemented a Kernel Density Estimator. I have a multivariate dataset that I use with it and I would like to find the point of highest likelihood. A way I thought about is sampling n points using the KDE as my PDF. Then for every point I could use a gradient descent technique and hope to find the maximum.

Is there a better / common way how to do this?

Chris
  • 455
  • 4
  • 11
  • 1
    A one-sentence summary of my answer is: That's the mode and there are workable ways of finding it directly without the arbitrariness of choosing a kernel type or width (or a grid on which to compute density). But independently of that, I envisage a column of values and a corresponding column of density estimates; why isn't this a question of finding the row with the maximum density? There are usually built-in functions for that. (Note that if your problem is bivariate or multivariate density estimation, do please make that explicit.) – Nick Cox Jan 07 '16 at 10:58
  • My dataset is multidimensional, but I can still order the points by euclidean distance or some other metric and apply half-sample mode or? – Chris Jan 11 '16 at 12:24
  • I don't know implementations of half-sample mode except for univariate densities. I can't see that the idea generalises beyond one dimension. – Nick Cox Jan 11 '16 at 13:04
  • Then I'm not sure if your answer solves my problem, I'm sorry. I didn't mention it's multidimensional as I expected there's a general solution, independently of the number of dimensions. But anyway I think your answer is valuable contribution. – Chris Jan 11 '16 at 14:59
  • 1
    Without further ado: There is a recent publication called [Finding the Mode of a Kernel Density Estimate](https://arxiv.org/abs/1912.07673). (Disclaimer: I haven't read it yet, but I will.) – Make42 Feb 20 '20 at 17:47
  • @Make42 That looks like a great find. If, after reading it, you could post a short summary, that would make a fine answer to this question. – whuber Feb 21 '20 at 17:52
  • @whuber: Thank you. I will, once I read the paper. – Make42 Feb 21 '20 at 17:56
  • @whuber: I went through the paper - quite frankly, I don't get it - at least not the details. I am not sure if it is me or the paper. I could reiterate the basic ideas though - would that be helpful and worthy of an answer? I could also reiterate the abstract but I don't see the point in that. If you were to read the paper by any chance, I would be interested in a more approachable explanation. I got lost regarding the polynomial system in section 2. – Make42 Feb 22 '20 at 19:54
  • @Make42 The idea is as follows: first there are two regimes, low vs high dimensions. In the low dimension, we use the property that the mode must be within some distance r of one of the input points. Then they do an exhaustive search around each of the input points upto this distance r on a fine mesh. For this search, they expand each kernel as some taylor poly upto some required degree that gives enough precision. – Sandeep Silwal Mar 01 '21 at 20:22
  • Then they can reformulate the mode finding problem as solving a system of polynomial inequalities for each point in the fine mesh (the system only has 2 inequalities: 1) is value at x larger than some b 2) x is near the mesh point. then binary search on b). Altogether, given a fine enough mesh and large enough degree, you can prove that the best point you find will be close to the true mode – Sandeep Silwal Mar 01 '21 at 20:22
  • Now in the high dimensional regime, they simply apply a random projection (see https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma) to a smaller dimension and repeat the algorithm from above. There is a slight caveat of getting a solution in the *original* space after finding it in the projected space but it turns out a simple trick takes care of it – Sandeep Silwal Mar 01 '21 at 20:23
  • While it does give some provable guarantees, I am fairly sure that this algorithm is mostly of a theoretical interest since this polynomial system solving cannot be easily implemented. The runtimes are also exponential so performing many steps of the heuristic mean shift algorithm could be much faster – Sandeep Silwal Mar 01 '21 at 20:24
  • This doesnt rule out other approaches that get some good theroetical guarantees but are also practical. I can answer further questions about this paper if necessary – Sandeep Silwal Mar 01 '21 at 20:24
  • @SandeepSilwal: Let's see if I got this right: Step 1: They choose probing points, by spanning a mesh of the search space, but only using those mesh nodes as probing point that are close to an original data point. Step 2: Instead of calculating the real density, they approximate the "sum of Gaussians" as a "sum of polynomials" (which in itself is a polynomial) using Taylor expansions. Step 3: For each probing point they calculate the approximated density and then compare those with each other. Did I get this right? – Make42 Mar 06 '21 at 15:39
  • Question1: Could they use different probing point or is the mesh-character important for the following steps? Q2: The reason why they are "fast" is because they use Taylor expansion for calculating the "sum of Gaussians" at the probing points, instead of the "sum of Gaussians" directly. Is that right? Q3: Around which points are the Taylor expansions happening? The input points? – Make42 Mar 06 '21 at 15:39
  • @Make42 Yes your steps are correct. Q1: They are using a fine mesh around the given input points because they have a structural theorem that says roughly 'the mode must be within some distance r (which can be upper bounded) of one of the input points.' Then the mesh is chosen fine enough so that the function can only vary very little between mesh points. Q2: I'm not sure 'fast' is the right word here. Rather, you can prove things about polynomials (they use a previous result about an algo that solves polynomial inequalities provably. It's likely that no such algo exists for general functions). – Sandeep Silwal Mar 07 '21 at 22:24
  • Q3: Basically every function evaluation at each of the mesh points is evaluated for the polynomial approximation which you pick to a sufficiently high degree to guarantee that the value of the polynomial vs the real function only differs by very small (smaller than epsilon*1/n <= epsilon*max value). – Sandeep Silwal Mar 07 '21 at 22:25
  • To be more precise, whats happening is that you are asking the question "does there exists a point x near a fixed mesh point p that has value larger than c." Then you iterative over all p and c. Then for a fixed p and c, this question can be translated as a system of two polynomial inequalities which reduces to a previous algorithm. Overall I would say this is more of a stepping stone and hopefully it would be possible to get other results that have provable guarantees but also are more practical. – Sandeep Silwal Mar 07 '21 at 22:28
  • @SandeepSilwal: Q3 is meaning this: A series expansion, like the Taylor expansion, needs to be "expanded around a point $a$" as they say: https://mathworld.wolfram.com/TaylorSeries.html - I meant to ask around which point(s) they expand when evaluating the polynomial(s). – Make42 Mar 08 '21 at 12:37
  • If you are evaluating $f(x) = \exp(-||x-p||^2)$ where $p$ is one of your input points then you expand it at $p$. – Sandeep Silwal Mar 09 '21 at 01:37

2 Answers2

4

EDIT: This addresses the original question and is retained because it is relevant to that and also to the title of the question, which refers to the peak density. It isn't pertinent to the revised question text, which refers explicitly to the multivariate case.

As the peak of the (estimated) density corresponds to the (main) mode, an alternative is to estimate that mode directly.

A "half-sample mode" can be estimated using recursive selection of the half-sample with the shortest length. Although the idea has longer roots, note particularly the detailed and thorough discussion of Bickel and Frühwirth (2006).

First some more general comments: The mode is often disparaged or neglected by comparison with its siblings the mean and median, but it can be of distinct interest or even use, especially whenever distributions are unimodal but asymmetric. (Modes also have a long history, as readers of Thucydides will recall.)

If a variable is categorical or counted, the mode can usually be read off a frequency table, subject to the occurrence of ties. The same approach can be applied to any variable, subject to the resolution of measurement.

However, the main question is how to get at an estimate of the mode whenever a variable is measured with a resolution such that counting is not a reliable method, if especially all or almost all measurements are distinct. Many people will have been brought up to look at a histogram and read off an approximate value, and may have the impression that not much more can or should be done. Looking at a graph is naturally always a good idea to put any estimate of mode in context. A more modern way of doing it is to get a kernel estimate of the density and modes have been estimated in that way. Either of these approaches suffers from some arbitrariness, for example over bin origin and width or kernel type and width. This shouldn't usually matter, but sometimes a direct method would be useful.

Less obvious than looking for a peak in density, but still worth a try, is to look for a shoulder on a quantile plot.

Kernel estimation is an excellent method, especially when bimodality or multimodality is a possibility. The suggestion, however, is that it may be regarded as an independent method of assessing modality.

An idea of estimating the mode as the midpoint of the shortest interval that contains a fixed number of observations goes back at least to Dalenius (1965). See also Robertson and Cryer (1974), Bickel (2002) and Bickel and Frühwirth (2006) on other estimators of the mode.

The order statistics of a sample of $n$ values of $x$ are defined by

$x_{(1)} \le x_{(2)} \le \cdots \le x_{(n-1)} \le x_{(n)}$.

The half-sample mode can be defined using two rules.

Rule 1 If $n = 1$, the half-sample mode is $x_{(1)}$. If $n = 2$, the half-sample mode is $(x_{(1)} + x_{(2)}) / 2$. If $n = 3$, the half-sample mode is $(x_{(1)} + x_{(2)}) / 2$ if $x_{(1)}$ and $x_{(2)}$ are closer than $x_{(2)}$ and $x_{(3)}$, $(x_{(2)} + x_{(3)}) / 2$ if the opposite is true, and $x_{(2)}$ otherwise.

Rule 2 If $n \ge 4$, we apply recursive selection until left with $3$ or fewer values. First let $h_1 = \lfloor n / 2 \rfloor$. The shortest half of the data from rank $k$ to rank $k + h_1$ is identified to minimise

$x_{(k + h_1)} - x_{(k)}$

over $k = 1, \cdots, n - h_1$. Then the shortest half of those $h_1 + 1$ values is identified using $h_2 = \lfloor h_1 / 2 \rfloor$, and so on. To finish, use Rule 1.

Bickel, D.R. 2002. Robust estimators of the mode and skewness of continuous data. Computational Statistics & Data Analysis 39: 153-163.

Bickel, D.R. and R. Frühwirth. 2006. On a fast, robust estimator of the mode: comparisons to other estimators with applications. Computational Statistics & Data Analysis 50: 3500-3530.

Dalenius, T. 1965. The mode - A neglected statistical parameter.
Journal, Royal Statistical Society A 128: 110-117.

Robertson, T. and J.D. Cryer. 1974. An iterative procedure for estimating the mode. Journal, American Statistical Association 69: 1012-1016.

Note: a Stata implementation is named hsmode (SSC) for which see here. The notes here are based on the help for that program, which includes further discussion and references. Stata users can install using ssc install hsmode. For other implementations see this webpage

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • I'd forgotten when I wrote this that I had already written this answer to a different question: http://stats.stackexchange.com/questions/176112/how-to-find-the-mode-of-a-probability-density-function/176144#176144 – Nick Cox Jan 07 '16 at 11:32
  • I preliminary accept your answer until I tested it. But it seems to be solid. A question I have anyway, this technique (half-sample mode) kind of makes my KDE obsolete, as it's only working on the dataset or? Except I add another gradient decent on top to get a more precise estimate. – Chris Jan 11 '16 at 12:23
  • Nothing here makes kernel density estimation obsolete, or even redundant. You always need a check on what the distribution looks like. – Nick Cox Jan 11 '16 at 13:00
2

A Kernel Density Estimator is defined as $$f(x;h) = \frac{1}{nh}\sum_{i=1}^{n}{K(\frac{x-x_i}{h})}$$ where $K$ is defined as $$\forall x:K(x) > 0, K(-x)=K(x)$$ $$\int_{-\infty}^{\infty}K(x)dx = 1$$ $$\mathbf{E}[K] = 0$$ Now the problem you posed is to find the maximum of the function $f$. The generality of the $K$ function makes the maximum hard to find analytically. Like you mentioned above I think, gradient descent is most likely the way to go. Luckily the expression for the gradient is rather simplistic. $$\frac{\partial f(x;h)}{\partial x} = \frac{1}{nh^2}\sum_{i=1}^{n}{\frac{dK}{dx}}$$ But, the optimization problem is not convex. And therefore gradient descent will not converge to the same point from different starting points. You have a couple options here. One, is what you stated. Sample a set of points and run gradient descent until convergence. And then select the maximum. But this seems computationally expensive, especially in a multidimensional case.

Another approach that is not as computationally expensive and is an approximation to the algorithm you stated above, is to select a point that you believe is close enough to the maximum and then compute gradient descent from there.

From the definition of K above, we can say that the maximum of K (if $K$ is symmetrical around 0) will be at 0. Therefore in function $f$, $x=x_i$. So intuitively a "good enough" starting point should be a point that minimizes the distances from all the data-set points. And this simply will be the mean or expected value of all your samples.

So to reiterate my approach for a computationally inexpensive approximation to the maximum. You first find the expected value of your data-set. Set that as your starting point in your gradient descent algorithm. And you run the optimization until converges.

Armen Aghajanyan
  • 1,003
  • 7
  • 12
  • "So intuitively a "good enough" starting point should be a point that minimizes the distances from all the data-set points." That is not the case, except for simple distributions which have only one mode or too large bandwidth parameter. – Make42 Feb 20 '20 at 17:45