I am designing a test plan to find factors that correlate with pavement failure. Failure is a rare occurrence. How would I go about designing the test?
I have N factors, with various factor levels, that I’d like to know the effects of. This is an observational study. I can’t apply factor treatments in a laboratory, but I can select roadway locations based on factors levels.
I see these as my options…
- Option 1: Use a balanced matrix with the N factors and take my random samples from those subpopulations. Since failure is rare, I will probably end up with a very small sample of roadways with failures say 5% or less of the total sample.
- Option 2: First identify roadways that are failed, then take 50% of my samples from the failed population, and then the other 50% from random non-failed roadways. I think this approach is better, but then I have little or no control over the factor levels and I won’t have a “balanced” data set when I run the statistics. Would I do typical MANOVA on the data?
- Option 3: Mixture of both, somehow. E.g., dedicate minimal sample size to address certain factor levels/combinations, ones that are presumed to be less influential, then dedicate the rest of the samples to the 50%-50% split of failed and non-failed samples.
What am I missing?