3

I saw this question (link) but when I read it, I see that it has a fixed "N" so I thought it was asking about for a finite sample size. When I read the answer that it was suggested to be a duplicate of (link) the answer is analytic, and in terms of the limit of infinite samples.

It was recommended that I ask a general question, for which I think the answer that I supplied may be useful.

Question: How do I find and do some basic comparisons on the "sampling distribution of the minimum" as a function of sample size and distribution? Is there a good, entry-level accessible approach?

EngrStudent
  • 8,232
  • 2
  • 29
  • 82
  • 1
    Theory provides a simple and complete answer: when sampling from a distribution $F$, the distribution of the minimum of $n$ independent variates is $1-(1-F)^n$. Because your question is so broad and unspecific, it seems impossible to refine that theoretical answer any further. – whuber Oct 03 '16 at 15:50

1 Answers1

1

I find there is a difference between "in the limit" and "in the real world". Part of the "in the real world" is that there is variation around every estimate.

"In general" does not necessarily mean not "in the real world". My point is that each type of distribution and each sampling approach is going to give results that are different from the "in the limit" approach. For more entry-level folks looking for the "how do I do it myself" and "what does it look like applied to my case" this might give a way forward.

The analytic expression is useful in the presence of many samples. It can be less than informative in the presence of fewer samples.

The following code does this:

  1. It draws samples from the same distribution, but it draws two different sample sizes.
  2. It computes the summary metric, in this case the minimum.
  3. It iterates (repeats) this many times
  4. We then look at how the summary varies by sample size.

Here is the code:

library(timeSeries) #for col stats
library(vioplot)    #for violin plot

#parameters
set.seed(220)  #for reproducibility

N <- 2000 #number of runs

n1 <- 5   #n samples in first set
n2 <- 500 #n samples in 2nd set

##initialize data
mystore <- as.data.frame(matrix(0,nrow = N, ncol=2))
names(mystore) <- c("fewSamples","manySamples")

##iterate
for (i in 1:N){

     ## different ways to work distributions
     # (note) what you do to y1 you must also do to y2 to be consistent
     #
     # rnorm(n1) #standard
     # rnorm(n1,mu=2,sd=2) #change mean and standard dev
     # rexp(n1) #exponential distribution with defaults
     # rexp(n1,rate=2) #exp dist with particular rate

     #draw first set of samples (few)
     y1 <- rnorm(n1)

     #draw 
     y2 <- rnorm(n2)

     #find minimum values
     mystore[i,1] <- min(y1)
     mystore[i,2] <- min(y2)
}



##plot the results

# scatterplot with violins
plot(mystore,xlim=c(-5,1),ylim=c(-6,-2))
vioplot(mystore[,1], col="tomato", horizontal=TRUE, at=-5.5, add=TRUE,lty=2, rectCol="gray")
vioplot(mystore[,2], col="cyan", horizontal=FALSE, at=-4.5, add=TRUE,lty=2, rectCol="gray")

#violin plots
vioplot(mystore[,1],mystore[,2],
        names = names(mystore),
        col=c("Green","Blue"))
grid()

#print the results
colMeans(mystore)   #central tendency
colSds(mystore)     #tendency of variation

#if they were equal this would be zero centered, and unit variances would add
summary(mystore[,2]-mystore[,1])

My text results were:

> colMeans(mystore)   #central tendency
 fewSamples manySamples 
  -1.166429   -3.039060 

> colSds(mystore)     #tendency of variation
 fewSamples manySamples 
  0.6445135   0.3698300 

> #if they were equal this would be zero centered, and unit variances would add
> summary(mystore[,2]-mystore[,1])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.4190 -2.3730 -1.8700 -1.8730 -1.3910  0.7731 

The mean, what you would expect if you took the minimum of 5 samples, is much smaller than the minimum you would expect if you took 500 samples. It is less than half of the value. The variation is much larger for the few samples, than for the many. Someone might say "this should be obvious" or "this should be intuitive" but if you haven't cultivated the intuition yet, then it isn't. If you haven't cultivated it yet, then experiments like this are ways to train it.

This first plot shows the raw points in scatterplot, and has overlaid violin plots indicating that central tendency and tendency of variation differ substantially given sample size. (n1,n2)

enter image description here

The second puts the violins side by side. The interior boxplots and violins also give a slight sense that skew is different for the distributions.

enter image description here

To apply this approach other distributions, or distributions with other parameters substitute the lines where the "draw" occurs.

UPDATE:
(Im working slowly on moving this to a new question, per request by a super-user.)

I think @Andy_W had an excellent point about the black swans. Here are links to Naseem Taleb, the book on Amazon, and the Wikipedia. The weaker version is a "grey swan" and the crazy fast-zombie big brother is the "dragon king" and are discussed in "silent risk".

EngrStudent
  • 8,232
  • 2
  • 29
  • 82