1

I understand that p-value is the probability that the correlation we got happened by random chance and we typically want to keep that under 5%. I am okay with this logic when it applies to sampling of a dataset - for example, if I'm sampling 300 out of 1000 individuals, p-value explains what the probability that the correlation just so happens to be in our sample is, if we picked any other sample of 300, we may not find the same correlation, that's why we need the p-value here. Am I right in my comprehension ?

Further, I am having trouble understanding the intuition behind the p-value for the correlation calculation if we use all the data points. Let's say I do not sample, and I take all 1000 data points available of individuals I have and get a correlation value, how do I interpret the p-value for this? Because the probability of selection bias no longer exist, so technically, whatever correlation I get, should be the correct correlation and it will not happen by chance. So should I still care about the p-value if I use the entire dataset instead of sampling?

Helene
  • 289
  • 2
  • 7
  • 3
    Your understanding of p-values could use some refinement: please visit http://stats.stackexchange.com/questions/31/what-is-the-meaning-of-p-values-and-t-values-in-statistical-tests. Your question seems otherwise to be the same as http://stats.stackexchange.com/questions/2628, but it's impossible to tell because you haven't explained where those individuals came from or what you are trying to accomplish. Are you trying to draw conclusions about *only* those 1000 individuals or do you hope to make more universal inferences? – whuber Nov 30 '16 at 06:32
  • "To keep that under 5%." pl. elaborate. "p value for correlation calculation ? –  Nov 30 '16 at 10:55
  • First para may be O.K. for a question. The second para should result in a separate question - with modification of course. –  Nov 30 '16 at 11:04
  • @whuber, apologies, I want to draw conclusions about only those 1000 individuals. For example, I have 1000 college seniors at a specific college, and I want to measure the correlation between the variable income and the variable GPA. There are libraries out there that can do this, but when I ran my data through those libraries, I got a p value along with it. So now I'm confused about this p-value - as I am running correlation on the entire set and I want to make conclusions only on those 1000 individuals. – Helene Nov 30 '16 at 18:06
  • @whuber, so I just read the link you posted. That's an interesting way of thinking about it. Assuming an underlying hypothetical population! I will for sure keep that in mind. But I don't think my scenario is included in that case, as I only want to measure the 1000 individuals I have on hand and I do not care about anyone else. Am I going about thinking this the wrong way? – Helene Nov 30 '16 at 18:09
  • 1
    I hear you, but I have to question you anyway: just what good would it do to evaluate that correlation for these 1000 people? I can't help thinking you *really* intend to think that next year's seniors will have a comparable correlation, or perhaps seniors at comparable colleges will have a comparable correlation. If all you want to do is *describe* the correlation that actually pertains, *without implying anything about any other group of people at any other time or place,* then go ahead and compute the correlation and report it. – whuber Nov 30 '16 at 18:09
  • @whuber, you're absolutely right! Thanks for keeping my thoughts on the right track. – Helene Nov 30 '16 at 20:40
  • It may help users if you indicate data and or your project to have a suitable answer. –  Dec 04 '16 at 15:23

0 Answers0