Including a weighting variable in a linear regression

Question

I'm looking at how temperature affects length. My length variable is the mean length calculated for every year, it is derived from ~10,000 data points. Not every year had the same sampling effort (e.g. 1998 n=300 vs 2001 n=2078).

A colleague suggested that since my 20 length data points are in fact derived from ~10,000 data points and each year had a different sample size I should consider using sample size as a weighting variable. I am not sure how to best implement this. I came across this post, which made sense and listed exactly my case "analyzing data in an aggregated form, such as the weight variable encodes how many original observations each row in the aggregated data represents". However, I am a bit confused, as this goes on to use frequency as a weighting variable.

I am uncertain as to how to calculate a weighting variable for my data? And once its calculated can I use it in the weights argument in lm()?

Or does it make more sense to calculate a weighted arithmetic mean for each year as in this post and use that in place of the mean length I used before?

score 1 · Accepted Answer · answered Jan 12 '21 at 19:23

From documentation for lm

Non-NULL weights can be used to indicate that different observations have different 
variances (with the values in weights being inversely proportional to the variances); 
or equivalently, when the elements of weights are positive integers w_i, that each 
response y_i is the mean of w_i unit-weight observations (including the case that 
there are w_i observations equal to y_i and the data have been summarized).

Suppose your $y_1$ is the mean of 1000 observations, your $y_2$ is the mean of 600 observations, $y_3$ is the mean of 400 observations. You would include it like this:

lm(y~x,weights=c(1000,600,400))

Including a weighting variable in a linear regression

1 Answers1