How to find a population of healthy products among many samples

Question

A measurement is performed on 100 equal products. Some of these products contain defects. I expect defects to be present on a small number of products, say a total of 10. Now I want to create a population of 30 products (or so) that do contain the statistical product-to-product variation but that do not contain defects.

My approach is as following: I randomly select 30 products. Then I calculate the mean and standard deviation of this population thirty times. The first value is the average of all products except the first product, the second all products except the second one, and so on.

In this manner the products that deviate a lot can be easily found, as removing them from the selected population of 30 causes a large change in the average and standard deviation, see the figure below. This product can be replaced by another, after which the process is reran.

The problem is:

The defect product causes a shift in the mean, covering up other defective products (product 13 makes product 17 seem les bad in the figure below).
I can add bounds (a certain factor) above and below the population mean and remove the outlying products but this is primitive.

My questions: is there any more statistically sound way of determining the probability that a certain product lies outside the population? And is there any literature you could recommend on finding populations from a large group of samples?

have a look [here](https://stats.stackexchange.com/questions/121071/can-we-use-leave-one-out-mean-and-standard-deviation-to-reveal-the-outliers/121075#121075) — user603, Sep 05 '18 at 13:50
I believe what you're asking falls under [statistical process control](https://en.wikipedia.org/wiki/Statistical_process_control). — Digio, Sep 05 '18 at 13:58
@user603, thank you for the extensive answer. I do not have enough reputation to comment there, so I will do it here. I have some questions: 1. In your first figure the outliers show a lower score. Is this not expected? The numerator in out_1 is greater than zero, however the denominator increases faster (due to the squared term in the standard deviation). Thus the value will be small. Or is this covered up if many values show large deviations from the leave one out mean? 2. What is the name of the outlyer detection function you propose? Can you recommend any literature on it? — H. Vabri, Sep 07 '18 at 08:40
1. yes, your intuition is correct. 2. The outlier detection function based on the median and mad was first proposed by Gauss. As you can imagine, you can find many papers on it. [Here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=2ahUKEwit3fntwqjdAhUMLlAKHX-FCvEQFjACegQICBAC&url=https%3A%2F%2Fdipot.ulb.ac.be%2Fdspace%2Fbitstream%2F2013%2F139499%2F1%2FLeys_MAD_final-libre.pdf&usg=AOvVaw09u_fqEZY2-z1uRwKLJgNR) is one recent non technical one. [Here](https://www.jstor.org/stable/1268758?seq=1#page_scan_tab_contents) is an older more technical one. — user603, Sep 07 '18 at 09:04

BruceET · Accepted Answer · 2019-02-19T08:23:14.027

What you mean by defectives is not clear. And what you mean by typical product-to-product variation is not clear. A reasonable solution to your problem requires clarification on at least one of these points.

Absolute standards to identify defectives. If defectives are typically above or below certain boundaries, then your 'simplistic' method in your item (2) should be just fine. In manufacturing pharmaceutical drugs there is often a baseline of potency (call it 100%) and any lot falling (let's say) below 80% or above 120% of the standard potency level is automatically discarded. The contents of discarded batches may actually contain perfectly effective drug, but the irregularity in production "speaks for itself" and it is deemed too expensive or risky to try to rescue the non-conforming lots.

If you suspect that 10% of the product may be defective, having extreme values, a variant of this method might be to sort the data and 'trim' off the top and bottom 5% of the data, leaving (one hopes) a representative 90% of the sample intact. [This method is the basis of 'trimmed means'.]

Regard outliers as defectives. If you have no reasonable standard and just want to get rid of 'outliers', that is a different problem, which has been "solved" in various ways. One of them is the procedure in your item (1). In your graph getting rid of #13 does make #17 look bad by comparison; and to a lesser extent maybe getting rid of both #13 and #17 makes #4 look 'iffy', and there may be no end to what gets discarded. If high values are 'bad' then it would be unclear where to stop trimming high values from the sample below from an exponential distribution $(\mu=5).$

Another solution to identifying outliers is to use the outliers often shown in boxplots. It uses the interquartile range (IQR) to measure dispersion. The IQR is the distance between the lower quartile $(Q_1)$ and upper quartile $(Q_3),$ so it spans the 'middle half' of the data. This measure has the advantage of not being sensitive to outliers. Roughly speaking. any observation below $Q_1 - 1.5\text{IQR}$ or above $Q_3 + 1.5\text{IQR}$ is called an outlier.

A difficulty with this method is that such outliers are regular features of some kinds of data: exponential samples almost always have them; even normal $(\mu=100,\, \sigma=15)$ samples characteristically have a few, as shown in the 10 boxplots of samples of size 50 below. [There happen to be more outliers here than usual. About 36% of normal samples of size 50 have outliers, the average number of outliers per sample, among all samples, is about 0.6.]

A third method is described in @User603's link. It uses a different measure of dispersion that is not sensitive to outliers. Still other methods of identifying items made by a process 'out-of-control' mentioned in @Digio's link may be applicable.

Minitab statistical software uses a method similar to your item (1) in regression output to call attention to observations that have 'unusually large residuals'.

The problem with any outlier rule is that individuals tagged as 'outliers' may not be 'defective'. After some outliers are removed, a second iteration may tag even more. Even stopping at the first iteration may leave you with a sample that does not really express the usual range of product-to-product variation.

How to find a population of healthy products among many samples

1 Answers1