3

I read the explanation by Ocram here about how to calculate the stddev of coefficients in linear regression.

I also run experiment with my sample data. I have test1 which contains 1000 samples; I then created

test2 = pandas.concat([test1, test1])

and run the same regression again. It is true that the standard deviation of each coefficient ($\beta$) decreased.

However, I cannot visually think about this behavior. Can anyone provide an intuitive visual explanation why the stddev goes down when I duplicate my samples?

eight3
  • 235
  • 1
  • 6
  • 1
    Not a visual explanation, but: you're dividing by sample size, which has doubled. There's nothing specific to regressions here. Take any set of numbers and calculate the standard deviation. Now duplicate them and calculate it again. You'll find the second is smaller because you are dividing by sample size. – mkt Oct 08 '19 at 12:22
  • More data leads to greater precision in the estimates. – whuber Oct 08 '19 at 12:54
  • @whuber, but those 'more data' are actually **duplicated** data, right? – eight3 Oct 08 '19 at 12:55
  • 2
    The formulas don't know that. – whuber Oct 08 '19 at 12:58
  • The formula knows everything :). This is a rather dangerous viewpoint to accept that something this simple cannot be explained mathematically. To @mkt if we were dividing by sample size then the variance would not be decreased. The cause is that we are dividing by sample size-1 – Hooman Oct 08 '19 at 13:33
  • 3
    @Hooman please don't confuse a formula with knowledge of its application. Statistics is not mathematics. In this case, applying the standard formulas involves an implicit assumption of iid errors, but duplicating the data destroys that assumption: it's no longer even remotely plausible. Thus, it's the wrong formula altogether, however valid it might be as a mathematical expression. – whuber Oct 08 '19 at 13:59
  • I'm voting to close this question as off-topic because the OP uses the false premise that repeating previously sampled data point improves the accuracy of the estimates. – Michael R. Chernick Oct 08 '19 at 18:22
  • @wuber, what I mean by 'should be explainable by the formula' is that although the formula for the variance is not correct we should be able to track where and why it is goes wrong when we duplicate the data. In this case as you already wrote ,in the derivation it is assumed that the noise samples are i.i.d. – Hooman Oct 08 '19 at 18:48
  • @MichaelChernick I can't see where I said "repeating previously sampled data point improves the accuracy of the estimates" . I only said the std dev of beta (the fitting coefficients) becomes smaller, which is true: you can easily run a test using your own data. Please revert what you have done to this post! – eight3 Oct 09 '19 at 07:03
  • It is not a matter of the exact word you said. The exact words you used were "Can anyone provide a visual explanation why the stddev goes down when I duplicate samples?' Duplicating data does nothing to change the actual standard deviation. but by applying a formula as if the duplicates were independent observations creates an estimate that is less than the estimate was without the duplicates. – Michael R. Chernick Oct 09 '19 at 14:40

0 Answers0