1

I have a vector of 10,000 observations and I need to estimate the pdf at each point of the vector. The code I have is the following (Matlab):

n = 1000; %Sample size
Y = normrnd(0,1,n,1); %Data
h = 1.06*std(Y)*n^(-1/5); %Bandwith
pair = (Y - Y')/h; %Argument of Kernel
pair_kernel = normpdf(pair); %calculate kernel on all the matrix
f_hat = mean(pair_kernel,2)/h; %Kernel estimate of pdf

The running time increases considerably when I go from n=1000 to n=10000. This is an issue because I have to do this many times as it is a Monte Carlo study. I wonder if there is a more efficient way to deal with the large matrices.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    Is there a reason why just estimating the pdf on the same set of equally spaced points won't give you a good approximation of the pdf? – JimB Jun 08 '21 at 06:25

1 Answers1

1
  • If you want to use the kernel density then yes, the biggest bottleneck would be doing the calculation over all the values, you cannot do much about it. In practice, what people do to speed it up, is they do not use the estimator directly, but they approximate it instead.

    R's density function uses the FFT transformations, to reduce the number of points used in the computation. Scikit-learn used a different approach, they store the training data in a tree structure, using ball trees and KD trees, so that instead of looking at all the data, you are evaluating only the neighborhood of the $x$ point.

    This trades off the quality of the estimate over computation time.

  • I'm not sure what you are trying to do here, but if you want to sample from kernel density there are algorithms for that that do not need you to evaluate the probability density function.

  • As noticed in the comment by JimB, if the only thing you want to do is to estimate the distribution, this is what kernel density estimator does, you do not need to sample anything to achieve that.

Tim
  • 108,699
  • 20
  • 212
  • 390