How to calculate the probability density function for highly skewed data?

Question

I have an experiment that shoots particles at a wall. It shoots some parts of the wall with a higher probability than others. I can record where the particles land. I need to know the underlying probability density functions (PDFs) (more precisely I need to know the peaks of the PDFs). The typical sort of data I am dealing with is given at the bottom of this post.

I have read that Kernel Density Estimation (KDE) is the best way to do this. However, I am struggling to find a python package that provides all my requirements. You can see from the attached graphs that the data can be highly skewed, i.e. the x-scale can be very different from the y-scale. Therefore, I thought it was important to have a D class kernel. This KDEMultivariate function is the only public python routine I am aware of which uses a D class kernel. However, they don't provide an option to weigh the data. I would like to weigh the data with the energy of the particles.

Do you know any good routines for my problem? Do you have any advice to help me estimate a probability density function for the data below which is as accurate as possible? I am most familiar with python.

Edit - Here is my response to some of the comments:

I need to weigh by energy to calculate the energy flux density. Here we are looking at approximately $10^5$ particles. I will then use this to infer the flux density when $10^{20}$ particles are fired. This will allow me to work out if the walls can handle the load. Hence, I only need to know the peak of the probability density function.
The D-class kernel is explained in this Wikipedia article. It basically means the bandwidth in the x-direction is different to the y-direction.
I need the PDE to be as accurate as possible especially around the peak. If you think a simple Gaussian kernel will be fine then I guess I will do that.

Why do you want to weight by energy? It seems sample size is huge ... I don't know what is a *D-class kernel*, maybe explain? — kjetil b halvorsen, Nov 04 '21 at 11:39
These data are not "highly skewed" in any important sense. Any decent KDE function (that is, one based on the FFT) will handle these data just fine. Your reference to "D-class kernels" refers to *multivariate* data, but aren't you just interested in the univariate position along the wall? Or are you viewing this as a bivariate problem? Regardless, a simple Gaussian kernel will do just fine. — whuber, Nov 04 '21 at 11:58
Thanks for the responses. I need to weigh by energy to calculate the energy flux density. The D-class kernel is explained in [this Wikipedia article](https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation). It basically means the bandwidth in the x-direction is different to the y-direction. This is a bivariate problem. You can see at about x=12 in Experiment 1 the density peaks at about y=0 then sharply goes to zero. I need the PDE to be as accurate as possible especially around the peak. But if you think a simple Gaussian kernel will be fine then I guess I will do that. — Peanutlex, Nov 04 '21 at 12:29
Please do not give new information only in comments, edit your question to add the new information. We want posts to be self-contained, comments can be deleted, and anyhow, information in comments are not well organized. Also, many people do not read comments. — kjetil b halvorsen, Nov 04 '21 at 14:16
(1) You get different bandwidths in the two dimensions simply by rescaling one of the dimensions relative to the other. (2) Having done that, bin the data into a square grid, weighting by energy. Use the FFT to convolve this with any kernel shape you like: the result is your density estimate (on a grid). (3) Identify the points where the maximum *as well as any values close to the maximum* are achieved. (The latter is to make sure a little noise doesn't totally ruin the answer.) (4) If necessary, repeat in high-resolution windows around those near-maxima. — whuber, Nov 04 '21 at 17:27
Thank you very much. Do you mind explaining what you mean by "Use the FFT to convolved this"? Do you mean to use an FFTKDE such as [this one](https://kdepy.readthedocs.io/en/latest/API.html#KDEpy.FFTKDE.FFTKDE)? — Peanutlex, Nov 04 '21 at 17:37
My plan is just to use Scott's rule to determine the bandwidth? Do you think that will work okay with this distribution? What would you recommend I use? — Peanutlex, Nov 05 '21 at 11:30
There's not enough information here to advise you on bandwidth selection, but I would suppose you would benefit from using knowledge about the experiment rather than accepting some general default from a statistical package. At https://stats.stackexchange.com/a/428083/919 I describe a disciplined approach for using density estimates to find peaks in univariate data--it can be applied in higher dimensions, too. — whuber, Nov 05 '21 at 14:11
Thank you very much. It took me a while to understand the other post (especially since I don't know R) but it looks like the perfect solution to my problem! — Peanutlex, Nov 05 '21 at 16:46

How to calculate the probability density function for highly skewed data?

0 Answers0