0

If I have $x$, $y$ data pairs and I am fitting $y = ax + b$, so I am using a standard OLS model. My questions are:

  1. If I take a particular value of the tuple (x,y), say $x_5$, $y_5$ pair and add it to the data set, potentially a million times, what effect does it have on the slope and intercept. What is good way to theoretically argue about the effect?. I understand that adding $x_5$, $y_5$ multiple times is same as putting that many times more weight on $x_5$, $y_5$. Does it mean that the OLS will try harder to go through that point than before hence leading to a different estimate of $a$, $b$ than before? What if the points we add $x_5$ is really close to $\bar x$, will that change things? Please note that I can test all this really easy in R, but I am looking for a theoretical explanation.

  2. The second question comes from this post and @gung's answer (from Clustering in data), which is really informative. How to prove that resampling a particular value of $x$ say $x_5$, leads to more accurate approximation of the vertical position $f(x_5)$. What are the implications of this on design of experiments? In other words should we artificially include more observations around the values of $x$ for which we are more interested in?

  3. If I bin my data into say 10 bins and draw a set number of samples from those bins, and then create a fit y~x, how should the slope and intercept be expected to change?

gbh.
  • 721
  • 1
  • 6
  • 15
  • 1
    You need to be clearer on what you mean by "resample". Are you talking about bootstrapping a finite dataset? OTOH, if you want to know the heart attack rate for people with systolic BP = 140, finding ever more patients @ 140 isn't "resampling", it's just a sample gathered at a theoretically determined value on SBP. – gung - Reinstate Monica Nov 30 '15 at 21:35
  • Thanks for the edits, really meant shoving in more pairs x5,y5. So not resampling but just putting in more data. – gbh. Nov 30 '15 at 21:42
  • Also, what do you mean by "$x5$"? Do you mean $X = 5$, or $x_5$ (the x-value for the 5th observation)? In the prior thread, I was referring to a particular value for X, but we don't sample for a particular value for Y, that is understood to be stochastic. – gung - Reinstate Monica Nov 30 '15 at 21:43
  • Putting in more data from where? Why would you bin your data? I'm beginning to wonder if there is a lot more going on here than was stated or that I had in mind when I answered before. – gung - Reinstate Monica Nov 30 '15 at 21:44
  • Bin the data to crate an even sampling to avoid the clustering problem. It is just an idea. But you mention that its not important. – gbh. Nov 30 '15 at 21:46

0 Answers0