Optimal bandwidth selection in conditional density estimation

Question

Consider the situation that we are estimating a $d$-dimensional density (with suitable regularity conditions) using kernel density estimation,

[Method1,conditional density estimation] We can proceed $d$ $1$-dim density estimation sequentially, i.e. $p(X_1),p(X_2\mid X_1),\cdots p(X_d\mid X_{d-1},\cdots X_1)$. In each step we can choose a $1$-dim optimal bandwidth.

[Method2, multivariate density estimation] Alternatively, we can estimate $p(X_1,\cdots X_d)$ directly as a $d$-dim density estimation problem. Thus, we can choose bandwidths for all $d$-coordinates simultaneously.(Of course not necessarily the same bandwidths in different coordinates.)

I have been told that, in kernel density estimation, the optimal bandwidth in each step of Method 1 will be different from optimal bandwidth chosen in each direction when the optimal bandwidth is chosen in a $d$-dim problem in Method 2(Say via Likelihood cross validation).

(1)Can anyone point to me a literature addressing such a problem? I found [Silverman] is not too useful in this question. My primary guess will be something about robustness, but I could be wrong.

(2)What is the major obstacle generalizing the univariate density estimation methods to higher dimension except for sparsity of the data?

(3)What is the well-adopted/popular choice of optimal bandwidth selection in a multivariate density estimation problem?

The second part of this post concerns the comparison between parametric(kernel density estimation) and nonparametric methods of density estimations(For example, Bayesian histogram).

It is generally a difficult problem to estimate the correct dimension of the density [1], but if now we already know the dimension $d$ of the data, then we can use some established methods to proceed our density estimation(Dirichlet or Gaussian process, for instances). And if our only concern now is precision (say MISE) of the estimation towards the density, then

(4)If we turn to Bayesian nonparametric methods of density estimation instead of the kernel density estimation (say Bayesian histogram), which one will perform better, under different cases with or without sparsity.Are there any existing literature that addressed the problem with some simulation studies?

Reference

[Silverman]Silverman, Bernard W. Density estimation for statistics and data analysis. Vol. 26. CRC press, 1986.

[1]Is there an accepted method to determine an approximate dimension for manifold learning

I am not an expert in the branch of density estimation so any reference and clarification will be of great help and appreciated. And this is mainly a ref-request as well as request for an overview.

(1) I'm not sure I understand the question. You are computing "optimal" bandwidths w.r.t different criteria. If you're using different methods,it's only logical to assume they will produce different results. — Adrien, Mar 15 '17 at 10:14
(2) I would say the fact that true optimal kernels and bandwidths are not product of 1D kernels and 1D bandwidths, but full d-Dimensional kernels and bandwidths. You can of course restrict yourself to product kernels, but it comes with a price. — Adrien, Mar 15 '17 at 10:17
(4) Kernel density estimation is totally nonparametric, why are you saying it's a parametric method? — Adrien, Mar 15 '17 at 10:20
@Adrien (1,2)I am asking for the same kernel with different procedures(in different dimensions the optimal bandwidth selection varies, so if we choose the bandwidth stepwisely then the resulting bandwidth may or may not be the same as the optimal bandwidth we choose simultaneously for every dimension). It is not related with product kernel or not; — Henry.L, Mar 15 '17 at 12:10
@Adrien (4)Personally I treat every procedure with an explicit basis as parametric(You can also call it nonparametric, but what I have for non parametric in mind is sth like bayesian nonparametrics). Here the basis is kernels and the dimension of the space may or may not be infinite. — Henry.L, Mar 15 '17 at 12:12
(4) I don't know what you call a basis in kernel density estimation, but anyway that is not the definition of nonparametric. If you don't believe me, check the first sentence of the wikipedia article: https://en.wikipedia.org/wiki/Kernel_density_estimation — Adrien, Mar 15 '17 at 13:01
@Adrien I did not mean KDE is NOT nonparametric, I just want to distinguish KDE from Dirichlet/Gaussian processes methods. And it is well accepted that kernel smoothing is more like polynomial smoothing rather than more recent *Bayesian* nonparametric methods like Dirichlet/Gaussian processes, I really do not see what is your point here... — Henry.L, Mar 15 '17 at 13:47
There is really NO definition of nonparametric from my perspective, finally you must reduce to parametric cases in order to make any meaningful inference, the only difference is that nonparametrics do NOT impose restrictions on numbers of parameters and thus more flexible. Consider those "classical nonparametric tests(say Mann-Whitney)" may be easier for you to see what is my point here. But thanks for the clarification. — Henry.L, Mar 15 '17 at 13:52
Can't agree with that; the sample mean is a nonparametric estimator of the population mean, and there's only one thing being estimated. "Nonparametric", writing loosely, implies you aren't assuming that you know the true functional form, in this case the true distribution, so are fitting a more general function whose parameterization allows you to "get close to" a broader region of function space than you could if you just assumed, e.g., the data followed a MV Gaussian.. It does not imply"no parameters" or "infinite parameters". — jbowman, Mar 15 '17 at 14:02
@jbowman You are just using a basis consisting of kernels with different parameters to approximate a functional form. I think all inference is done on the space of measures on the sample space. By assuming an exact functional form you just assume a better loss function which does not penalize a specific parametric family...If you really want, you can also call regression models semi-parametrics to impress your friends. I do not think this thread is relevant to OP and we can start another POST if you want to discuss, thanks. — Henry.L, Mar 15 '17 at 14:08

Optimal bandwidth selection in conditional density estimation

0 Answers0