Check if sample is representative of a larger sample

Question

From a larger sample of tabular data, I have picked certain rows that meet a certain condition (this condition is unrelated to the actual data in the rows).

Now, I want to know if the distribution of this subset that I have created is similar to the distribution of the original, larger sample.

What test(s) can I use for this purpose?

Thanks! I appreciate the help.

What does your data look like? Is it discrete or continuous? — soakley, May 13 '14 at 18:27
It's continuos. The data is basically words with sentiment scores (in the range of -5 to 5). — Debi, May 13 '14 at 18:31
What if the data is discrete instead? Can the Kolmogorov-Smirnov test be used in that case? — rezita, Jun 10 '17 at 21:39

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

You could test whether several statistics that are descriptive of a distribution are the same in the subsample and the remaining sample. For example you could conduct tests for:

mean difference
median difference
stochastic dominance
different variance
shape

While you are at it, since you are interested in similarity, I would also explore tests for equivalence of all such measures (for example, using tost), probably combining inferences from difference and equivalence tests.

Something else you may want to consider: why are you interested in this similarity? The answer to this question may help you decide which, if any, such tests you may like to explore. For example, if you sample size is smallish, you may not have enough power for the Kolmogorov–Smirnov test mentioned by soakley, although you might still have power enough to make inferences about, say, the sample mean. If you are only interested in comparing sample means, that may be OK for your purposes.

Hi, as a follow up: The tost test (in R) asks for a magnitude of region of similarity. Any idea what this is? — Debi, May 13 '14 at 19:00
Yup: you need to make a decision about *how large a difference is meaningful*. In other words, the minimum size of difference that you are willing to accept as important. See for more info: http://stats.stackexchange.com/tags/tost/info (and pay attention to what $\Delta$ is). — Alexis, May 13 '14 at 19:36

score 4 · Answer 2 · answered May 13 '14 at 18:22

4

Since you want to compare the entire distributions, I'd recommend the two sample Kolmogorov-Smirnov test.

More information can be found here:

http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

answered May 13 '14 at 18:22

soakley

4,341
3
16
27

3

One of the assumptions is that the two samples are mutually independent. Given your description, it may be more appropriate to perform a one-sample KS test of the sample versus the parent population. – soakley May 13 '14 at 18:37

Check if sample is representative of a larger sample

2 Answers2