If I have T total records, how big should my sample size be for a valid analysis?

Question

I have a bunch of records, T in total. I want to know how many of these I can get away with analyzing in order to extrapolate the analysis to the entire population T.

I know this is a basic question and largely depends on how much error I can accept, but can anyone tell me the math?

It also depends on what the records represent (ie are they independent observations, or a sequence of events on independent units in time, or a sequence of events for a single unit) and what kind of analysis (summarize the data, compare two groups, regression an outcome on some covariates) you plan to do. Can you provide more details? It would be also helpful to know the reason you want to subsample instead of analyzing the full population. — Jeremy Coyle, Jan 30 '14 at 16:04
Independent observations. Cannot get to them all due to limited resources. — Remy F, Jan 30 '14 at 16:09
Basically I want to know, if I have T records, how many should I sample to get this-and-that margin of error, this-and-that confidence, etc. Or, alternatively, if I sample X out of T records, what is the resulting error/confidence/etc. Believe it or not I have Googled the heck out of this and can't find a straight, easy to understand answer/equation out there. This is all I want to know. — Remy F, Jan 30 '14 at 16:16
Try googling for [power analysis](http://en.wikipedia.org/wiki/Statistical_power). It allows you to answer questions like "*how many should I sample to get this-and-that margin of error, this-and-that confidence*". — Marc Claesen, Jan 30 '14 at 16:45
Can you say what you want a margin of error for, @RemyF? Eg, do you want to know the arithmetic mean of your population +/- some MoE, the proportion of your population w/ some attribute +/-, the SD of your population, etc? — gung - Reinstate Monica, Jan 30 '14 at 18:38
@gung I am not as versed in the jargon as everyone else here and I am not after something super rigorous. I just know that I have 5000 population and can't analyze them all. I want to know that if I choose S of them at random and analyze them, how confident I can be that they represent the population to some degree (if I were to perform the same analysis for all 5000) — Remy F, Jan 30 '14 at 18:42
There is no single answer to that, @RemyF. It depends on the nature of the analysis you are going to do & what you want to know. If you tell us that, we can try to help you further. If you can't, this question cannot be answered. — gung - Reinstate Monica, Jan 30 '14 at 18:44
Okay, then how many different types of common analyses are there and how does that change the response? I am assuming that the most common is taking the mean of various metrics. — Remy F, Jan 30 '14 at 18:45
(Copied & pasted from above) "Eg, do you want to know the arithmetic mean of your population +/- some MoE, the proportion of your population w/ some attribute +/-, the SD of your population, etc?" There are potentially innumerable types of analyses. Moreover, there is no way to know how much data you'll need to do what you want to do if you don't know what you want to do. — gung - Reinstate Monica, Jan 30 '14 at 18:48
For the most part, I believe just arithmetic means of various metrics of the population. — Remy F, Jan 30 '14 at 18:51
@RemyF People keep telling you that they need more information but you are not providing it! What kind of parameter are we trying to estimate? Is it an average? Is it a proportion? There is not enough information here. — bdeonovic, Jan 30 '14 at 20:18

score 3 · Answer 1 · edited Jun 11 '20 at 14:32

3

Assuming you want to estimate the mean of a variable with a certain margin of error you can use $E=z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$ where $E$ is the margin of error, $z_{\alpha/2}$ is the normal distribution quantile for the confidence level you want (1.96 for a 95% confidence interval), $\sigma$ is the standard deviation of the variable you're forming the confidence interval for, and $n$ is the sample size.

Obviously you don't know $\sigma$ but you can get a small sample (say n=100) and estimate it with the sample standard deviation to get a first approximation.

In general you can see from this formula that margin of error decreases proportional to $\sqrt(n)$, which should give you some intuition that there's diminishing returns for large n. This "root n" rate is common to anything that behaves like a mean (many parameters including regression coefficients are essentially means). This is all a result of the central limit theorem.

See http://stattrek.com/estimation/margin-of-error.aspx for more exposition.

edited Jun 11 '20 at 14:32

Community

1

answered Jan 30 '14 at 16:46

Jeremy Coyle

551
5
8

I'm not sure what you're asking, but confidence intervals/margin of errors relate to estimates for some parameter, most commonly estimates of population mean. This is why I asked you to clarify what kind of analysis you wanted to do. – Jeremy Coyle Jan 30 '14 at 17:52
This is why I honestly find statistics frustrating. Not attacking you directly, but it seems virtually impossible to get a straight answer for anything. I understand that confidence/errors relate to estimates, but I want to know what that means. Is it saying "Given a size of 1176 results, we can be 95% sure that the real value lies between .95x and 1.05x given that our margin of error is 5%"? – Remy F Jan 30 '14 at 17:54
Sorry you're frustrated. It is most certainly not saying that. A 95% confidence interval is an interval that has the property that "95% of intervals constructed in this way contain the true population parameter". Which, is admittedly not a straight answer. To see why that's the case, you would need to read up on the concept of a sampling distribution. I'd suggest doing some reading or taking an intro stats course. Maybe look into MIT OpenCourseWare or the Kahn Academy? – Jeremy Coyle Jan 30 '14 at 18:02
How is that different from what I said? If I have a confidence interval of 95%, and a margin of error of 5%, then aren't I saying that the interval ".95x to 1.05x" for some recorded value x contains the true population parameter 95% of the time? – Remy F Jan 30 '14 at 18:04
This is now pretty far off topic from the original question, but your statement "95% sure that the real value lies between ..." reads like a probability statement about the true parameter value which is goes against a basic assumption of the frequentist paradigm that the true parameter value is fixed. See http://en.wikipedia.org/wiki/Confidence_interval#Meaning_and_interpretation – Jeremy Coyle Jan 30 '14 at 18:16
I don't understand why a simple mathematical expression can't be given for what a confidence interval represents here. I have a population of p=5000 for example. I have a sample size of s=357, which for some reason corresponds to 95% confidence 5% error. I analyze the data for s records: I get a mean value of x for, say, income level. Now, what can I technically say about the population value (i.e. the value for all 5000) of income level, given all this? – Remy F Jan 30 '14 at 18:18
2

I think the key issue is that the OP has a *finite sample*, thus, the standard formulas don't apply, you need to use a finite sample correction. – gung - Reinstate Monica Jan 30 '14 at 18:35
What do you mean? This is getting to be horribly complicated for (IMO) a simple question. – Remy F Jan 30 '14 at 18:43
2

It's not that we're trying to make things complicated; they are *potentially* complicated because there's not enough to go on and we don't want to assume things that mightn't be true. This *is* a simple question and it *does* have a simple answer: 5000 of them, in fact, ranging from a [sample size of 1](http://stats.stackexchange.com/questions/83186) through the entire population. Nobody is happy with such a broad range of possibilities, though, because they don't narrow your options. If you want to make progress, you need to supply more information, as many commenters have been requesting. – whuber Jan 30 '14 at 21:22
2

@Gung all samples are finite: I presume you mean finite *population* and you are referring to a finite population correction. – whuber Jan 30 '14 at 21:23
1

@whuber, someone w/ enough grit & determination could keep sampling indefinitely. (Yes, I meant finite *population* ;-). – gung - Reinstate Monica Jan 30 '14 at 21:25

If I have T total records, how big should my sample size be for a valid analysis?

1 Answers1