1

I am conducting a word analysis project where I find the relative frequency of a certain word over time in a corpus of film reviews. The corpus changes in size over time so someone suggested to me that I conduct a weighted regression to take this into account, as the variance of the occurrence of the word will be higher in years when the corpus is smaller. Now, I had thought that a weighted regression was merely a normal regression but with weights attached to each observation (so relative frequency of word y in year x is weighted by the size of the corpus in that specific year). I looked it up online, and a weighted regression turns out to be a different beast entirely. I need a standard deviation for Y for each year. Yet, in this project I only have one observation for each year: total occurrence of a word divided by number of words in that year. What is there that can vary within a year? Am I misunderstanding how a weighted regression works? Is a weighted regression in truth not suitable for my project?

Hope this is clear. This is really driving me nuts

Pete C
  • 11
  • 1

1 Answers1

0

If I am not missing anything, you are doing a linear model for some variable $Y$ and one of your predictor variables is the frequency of a certain word in the corpus of a film review. Note the following, in a linear model $Y=\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon,$ the choice of a good estimator for the parameters depend on the shape of the variance-covariance matrix $V(\epsilon \vert X).$ Usually, the assumption of Homoscedasticity can be done and $V(\epsilon \vert X)=\sigma ^2 I_n.$ In other cases, this hypothesis is false and $V(\epsilon \vert X) \neq \sigma ^2 I_n.$ In that cases is when you use weighted regression for taking into account this fact in order to get a good (unbiased and optimal) estimator. Therefore, performing a weighted regression or not depends on if you can assume Homoscedasticity in your model or not. It does not depend on the form of your variables. There are a lot of choices to test Homoscedasticity (see for instance Breusch-Pagan test https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test).

So summarizing, take your initial regression and test Homocedasticity in order to know if you should perform a weighted regression. Try also with the absolute frequency (which takes into account the size of the corpus) and maybe you have Homoscedasticity and you can use a standard linear regression.

Finally, one fantastic fact, if you get wrong and use a standard regression (LSE estimator) in the case of heteroskedasticity, this estimation is not the best (because the best is using weighted regression) but it is not that bad (this estimator is consistent, that is, converges in probability to the true parameters).

  • Thank you. I do seem to have homoscedasticity according to the B-P test, which surprised me considering the unevenness of the corpus. – Pete C Apr 12 '21 at 19:44