2

I reasoned a bit on where to post this question if stats, quant or stack. I will post it here because it is merely a statistical problem.

I have a vector x of numbers in Python (you might also imagine for simplicity that is normal) I simulate its Kernel density starting from this vector:

import scikits.statsmodels.api as sm
kde = sm.nonparametric.KDEUnivariate(x)
kde.fit()

Now if I want to extract a random matrix (10000x100) from this distribution how could I do?

Should I use kde.score_sample? If yes, I cannot manage to do this extraction.

Tim
  • 108,699
  • 20
  • 212
  • 390
Thegamer23
  • 51
  • 1
  • 5

2 Answers2

2

Kernel density may be thought as a equally-weighted (with weights $\tfrac{1}{nh}$) mixture of kernels $K$ centered at datapoints $y_i$

$$ f(x) = \sum_{i=1}^n \frac{1}{nh} K\left(\frac{x - y_i}{h}\right) $$

To draw samples from univariate kernel density, the following procedure can be applied (Silverman, 1986):

Step 1. Sample $i$ uniformly with replacement from $1,\dots,n$.

Step 2. Generate $\varepsilon$ to have probability density $K$.

Step 3. Set $x = y_i + h\varepsilon$.

If samples are required to have the same variance as your data, the procedure is modified as following:

Step 3'.

$$ x = \bar y + (y_i - \bar y + h\varepsilon)/(1 + h^2 \sigma^2_K/\sigma^2_Y)^{1/2} $$

where $\sigma_K^2$ is variance of the kernel, while $\bar y$ and $\sigma_Y^2$ are variance and mean of $y$. If you are using Gaussian kernel, then to generate random draws from it you basically need only to be able to draw samples from standard normal distribution (for $\varepsilon$) and from discrete uniform distribution (for $i$'s).

The procedure can be easily extended for multivariate kernel densities, where instead of drawing single $y_i$ values, you draw whole rows of the $Y$ matrix.

To learn more you can check the documentation of kernelboot package for R (disclaimer: I'm the author) that goes into more details and provides multiple references on this subject.


Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall/CRC.

Tim
  • 108,699
  • 20
  • 212
  • 390
0

You can simply benefit of the .icdf() method that take random numbers from the interval $[0,1)$ as input. This way, you can sample as many samples as you want.

Karel Macek
  • 2,463
  • 11
  • 23