0

I'm currently doing an imaginary project where I'm investigating a possible correlation between the number of tests done for a specific disease and the number of deaths caused by said disease. I'm looking at several different areas of different population sizes separately, and the time period is divided into weeks, so that one observation = one week.

I know I should use Weighted Least Squares, but I just have a really hard time grasping the WLS itself, as well as the weights I should use for the observations.

What little advice I have gotten so far says that I could use the inverse of the dependent variable (no. of deaths) as weight, with the reasoning being that because the dependent variable is a count, a reasonable approximation is to assume that it follows a Poisson distribution. And in such a distribution both the mean and the variance is the expected no. of counts (estimated by the observed counts). Therefore, by weighting with the inverse of the dependent variable, I am weighting by the inverse of the estimated variance.

I can't say I grasp the above advice completely, but even if I accept it and go with it, the no. of deaths is 0 for so many observations, resulting in the weights being NA for those observations, since I can't compute 1/0.

I'm beginning to think I have completely misunderstood the advice I have gotten. Does anyone have an idea about how I should actually calculate the weights for my observations? And possibly also explain the reasoning behind it. I really wish to grasp the logic behind the weight calculation.

Below is an example of my data:

Week Tests Deaths
30 268 1
31 251 0
32 278 1
33 248 0
34 374 1
35 348 2
Carlsson
  • 3
  • 1

1 Answers1

0

You are correct that in a standard weighted linear regression: "The weights should, ideally, be equal to the reciprocal of the variance of the measurement," provided that the observations are independent.

With your outcome variable being small numbers of counts, however, you are better off using a generalized linear Poisson model that directly takes into account the fact that "both the mean and the variance is the expected no. of counts." If your interest is in a rate instead of the number of counts, you use the log of the extensive variable (time, area, population, etc.) as an offset. That overcomes the problems in standard linear regression that you note with how to weight 0-count observations, and the limitations imposed by not being able to meet the assumption of normally distributed errors around the predictions of a standard linear model, even weighted. Such models are implemented, for example, by the glm() function in R.

One further warning: these seem to be time-series data, which can violate the standard assumption about independence among observations. Even with a Poisson model you should take into account the correlations of observations across time within each area.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thank you so much for your reply! It's starting to get clearer for me now. And thanks for the advice regarding the standard assumption violation! Really greatly appreciated! :) – Carlsson Jul 19 '21 at 16:23