4

I am sampling from a model 10 times in series, doing this in 50 parallel processes. I am using LHS to generate samples each set of ten samples, although each of the 50 parallel runs' samples are generated independently. Will using LHS sampling in this situation create any bias? Is there any reason I should use random sampling instead of LHS?

kilojoules
  • 173
  • 19
  • Could you explain more clearly what you are doing? e.g. How large is the model parameter space? What do you mean by sampling in series/parallel? What quantity are you trying to estimate? This should increase the chances of you getting an answer. – S. Catterall Aug 29 '16 at 09:53
  • Why would the parameter space matter? I'm sampling with two unknown variables. I'm estimating E(f(X)) in order to approximate E(f(X))-f(E(X)). What I mean by sampling is parallel is that I have a sampling program which saves the sampled points to a csv. I run this program several times in parallel (I conceive of each of these runs as an "in series run"), aggregating the results. I am concerned that using the LHS algorithm with the parallel sampling may skew my results. – kilojoules Aug 29 '16 at 14:38
  • So the set of 500 sampling points for $X$ will consist of 50 independent LHS samples, each LHS sample being of size 10? – S. Catterall Aug 29 '16 at 14:59
  • More like, for a sample size of 500 points, I could set my program to sample 100 points using its LHS routine, saving the points to a file, and running this program 5 times in parallel. – kilojoules Aug 29 '16 at 15:20
  • I wonder if for your problem the use of so-called lating super-cube sampling would help? – user32038 Aug 25 '17 at 08:07

1 Answers1

2

LHS sampling run 'in parallel' in this way should still lead to unbiased estimates.

In standard LHS sampling, we generate vectors $X_1$, $X_2$,...,$X_n$ (with dimension $d$ equal to the dimension of the sampled parameter space for the model), where $n$ is the desired sample size. We then form the LHS estimate for the function of interest ($f$) using $\hat{\mu}_{LHS}=\frac{1}{n}\sum_{i=1}^n f(X_i)$. Each $X_i$ is distributed uniformly on the unit hypercube $[0,1)^d$ (see Theorem 10.1 in this book chapter), so it follows that $\hat{\mu}_{LHS}$ is an unbiased estimate of $\mu=\int_{[0,1)^d} f(x)dx$ i.e. $E(\hat{\mu}_{LHS})=\mu$.

For 'parallel' LHS sampling, we would generate vectors $X_{jk}$ where $1\leq j\leq N$ and $1\leq k\leq n$, giving an aggregate sample of size $Nn$. Here, each $X_{j1}$,$X_{j2}$,...,$X_{jn}$ is an independent LHS sample. If we define $\hat{\mu}=\frac{1}{Nn}\sum_{j,k}f(X_{jk})$ then, by Theorem 10.1 again, we have $E(f(X_{jk}))=\mu$ for every $j,k$, so $E(\hat{\mu})=\mu$ i.e. $\hat{\mu}$ is unbiased.

In summary, estimates obtained via 'parallel' LHS sampling are unbiased (same as for standard LHS), so this is not a reason to use simple random sampling rather than LHS. Of course, you could just use standard LHS i.e. generate a single LHS sample of size $Nn$. This should minimise the variance of the estimator. Parallel LHS would be expected to have a higher variance (but lower than for simple random sampling). However, an advantage of splitting into smaller subsamples is that you can increase the sample size simply by appending further subsamples of size $n$ (rather than starting again from nothing, as you would do with standard LHS sampling).

S. Catterall
  • 3,672
  • 1
  • 10
  • 18