Interpreting arithmetic mean in regression

Question

Problem

Suppose I have two variables: (1) heat index for each county in a state, $h_{it}$, and (2) acres in each county, $acres_{it}$. The data has 10 years and also includes a variable for the amount of ice cream melted, $y_{it}$ for each county and year in the sample.

I'm told that a strong predictor of ice cream melt can be found by weighting the heat index by the size of the county and then aggregate to state-level data, such that:

$$\frac{\sum_{i} h_{it} \cdot acres_{it}}{\sum_{i} acres_{it}} = hw_{st}$$

A simple linear regression can then predict the state-level ice cream melt by:

Regression 1:

$$log(y_{st}) = \beta_{1}h_{st} + \epsilon_{it}$$

Call:
lm(formula = log(y) ~ h, data = datt)

Coefficients:
(Intercept)            h  
    2.64010     -0.01072

Regression 2 (using weighted variable):

$$log(y_{st}) = \beta_{1}hw_{st} + \epsilon_{it}$$

Call:
lm(formula = log(y) ~ hw, data = datt)

Coefficients:
(Intercept)           hw  
    2.39100      0.04908

Question

Is the interpretation of these two regressions different? My interpretation for regression one is that an increase in $h$ increases $y$ by some percentage.

But what about the second regression? Is there a different way to interpret the regression coefficient because it is weighted?

Sample R Code:

library(dplyr)

# Sample Data
datt <- structure(list(year = c(2000L, 2001L, 2002L, 2000L, 2001L, 2002L, 
2000L, 2001L, 2002L, 2000L, 2001L, 2002L), county = c(1L, 1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), state = c("CA", "CA", 
"CA", "CA", "CA", "CA", "CO", "CO", "CO", "CO", "CO", "CO"), 
    y = c(5L, 10L, 7L, 4L, 2L, 8L, 9L, 11L, 2L, 5L, 6L, 8L), 
    h = c(5L, 7L, 1L, 9L, 6L, 4L, 8L, 2L, 5L, 8L, 7L, 1L), acres = c(10L, 
    25L, 40L, 8L, 13L, 42L, 50L, 24L, 57L, 24L, 35L, 15L)), .Names = c("year", 
"county", "state", "y", "h", "acres"), class = "data.frame", row.names = c(NA, 
-12L))


# Build Weighted Variable
datt<- datt%>% 
  group_by(year) %>% 
  mutate(w = acres/sum(acres, na.rm = TRUE))

# Apply Weight
datt$hw <- datt$h * datt$w

# Aggregate to State-level
datt<- datt%>% 
 group_by(year, state) %>% 
  summarise(hw = sum(hw, na.rm = TRUE),
            h = sum(h),
            y = sum(y))

# Regression 1
lm(log(y) ~ h, data = datt)

# Regression 2
lm(log(y) ~ hw, data = datt)

Related Question: Weighting variable based on another variable

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

I'm told that a strong predictor of ice cream melt...

You don't specify in either of your two questions what exactly is "ice cream melt": is it rate of ice cream melt if you hold a cone? is it the total volume of ice cream which melts across the county?

The answer which you accepted to your other question, however, states

because the more acres there are in a county, the more ice cream will melt in that county.

so I assume the interpretation is the latter - it is the total volume of melted ice cream.

In this case, it seems to me that the best option is

Find a linear fit $\beta$ between $log(y_{it})$ and $h_{it} acres_{it}$ (note that the fit is for all county instances, without aggregation).
Your estimate for the melt of some group of counties $I$ (which, in particular might be the counties composing a state) should be $\beta \sum_{it \in I}[h_{it} acres_{it}]$.

You present two other alternatives:

Find a linear fit $\beta$ between $log(y_{it}$ and $h_{it}$ (at least it seems so from your R code - the mathematical description mixes $st$ and $it$ indices, and so is unclear). This makes sense for melt rate, not melt volume, which doesn't seem to coincide with your other question.
Find a linear fit $\beta$ between $log(y_{st})$ and $hw_{st}$. To me, this makes no sense at all under any of the interpretations. If the interpretation is volume, you shouldn't divide the per-state aggregate by the acres of the state. If the interpretation is weight, acres seem irrelevant. The particular aggregation you do here also gives equal weights to states irrespective of the number of measurements performed within the state. Again, I don't see the logic in this.

I tried to simplify my larger problem, but it appears I may not have been successful. To clarify, I'm speaking, in hypothetical terms, about production in volume. If it is warmer then production will decrease because of the damage done from melting; by weighting the size of the acreas, this would imply larger acreage would have lower production, thus accounting for changes in production. Sorry if this is not clear, but it is used in the literature such as population dynamics. — Amstell, Oct 15 '16 at 16:01
@Amstell Good to clear that up. I'll update the answer later on based on that. — Ami Tavory, Oct 15 '16 at 16:03

Interpreting arithmetic mean in regression

1 Answers1