I have a number of samples that I would like to test for normality. One of the samples exceeds 5,000 data points, the limit up to which the shapiro test accepts samples. This is the data:
c1 <- exp(rnorm(505))
c2 <- exp(rnorm(550))
c3 <- exp(rnorm(5500))
cluster.data <- c(c1, c2, c3)
cluster.factors <- c(rep("Cluster_1", length(c1)),
rep("Cluster_2", length(c2)),
rep("Cluster_3", length(c3)))
# set up data for test:
cluster.df <- data.frame(cluster.data, cluster.factors)
To circumvent the 5,000 restriction, would it be statistically acceptable if I run the test on smallish subsamples of the data only? Here, for example, I draw a subsample of size 500 for all three variables:
tapply(cluster.df[,1], cluster.df[,2], function(x) shapiro.test(sample(x, 500)))
And the test returns sigificant results for all three:
$Cluster_1
Shapiro-Wilk normality test
data: sample(x, 500)
W = 0.59561, p-value < 2.2e-16
$Cluster_2
Shapiro-Wilk normality test
data: sample(x, 500)
W = 0.57891, p-value < 2.2e-16
$Cluster_3
Shapiro-Wilk normality test
data: sample(x, 500)
W = 0.67686, p-value < 2.2e-16