1

I have a data set containing the entirety of the population I'm interested in, not just a sample. However, there is an abundance of outliers in this data, which we have determined is due to a lot of incorrect reporting. I'm interested in running linear regression for this data (even though the data is not normally distributed) and I want to see which type of residual would be more beneficial to use.

From russellpierce's answer here, "Studentizated scores uses Student's/Gosset's calculation for estimating the population variance/standard deviation from the sample variance/standard deviation (s). In contrast, Standardized scores (a noun, a particular type of statistic, the Z score) are said to use the population standard deviation ?(σ)."

I'm working in Python on this, and already have both types of residuals automatically generated. I know studentized residuals are typically used for outlier detection, but I assume that's usually because data scientists only have a sample to work from? Which one would be better to use for outlier detection when I know I have the data for the entire population?

Kelsey
  • 121
  • 2
  • You might find it more constructive to conceive of your data as a sample, too, if only to create the opportunity to model and account for the errors of "incorrect reporting." – whuber Jun 26 '19 at 15:19

0 Answers0