1

I have created a standard OLS regression model to estimate the House Price and one group of variables describe the age group percentage of population in a particular neighborhood (ranging 0 to 100).

These variables are the percentage of the population in a particular neighborhood, belonging to an age group. For example Neighborhood Age 0-14 value of 23 would mean that there is 23% of people in a neighborhood, who are between 0-14 years old. The variables are presented below:

  • Neighborhood Age 0-14 %
  • Neighborhood Age 15-24 %
  • Neighborhood Age 25-44 %
  • Neighborhood Age 44-64 %
  • Neighborhood Age >64 %

Now I know that since these are percentage values, I have to remove at least one of them due to perfect linear dependence, for example: Neighborhood Age 0-14 % = 1 - SUM(All of the other age %)

I have removed the Neighborhood Age >64 % variable and estimated the coefficients. The estimated coefficients for each variable are this (House price has been log-transformed so interpretation is ${\Delta}P\% = {\beta}_{i} * {\Delta}X_i\%$):

  • Intercept: 11.1917
  • Neighborhood Age 0-14 %: 0.0229
  • Neighborhood Age 15-24 %: 0.0121
  • Neighborhood Age 25-44 %: 0.0002
  • Neighborhood Age 44-64 %: 0.008

As I removed one of the variables, how would I now interpret the Neighborhood Age >64 % effect on House Price? Note that these are continuous variables ranging 0-100.

  • It looks like you did not report the full results: what happened to the intercept? – whuber May 20 '19 at 15:38
  • My bad, edited the original question. There were a lot of variables and I tried to keep the confusion to a minimum. – Jurgis Samaitis May 20 '19 at 16:18
  • Thank you--that clarifies your question. You might be able to find the answer yourself by comparing these estimates to the original set of estimates. In particular (depending on how your software codes these categorical variables), the intercept should equal the original estimate for the >64% category. – whuber May 20 '19 at 16:23
  • Thank you for your answer - I understand the interpretation regarding the categorical variables, but these are continuous variables, ranging 0 - 100. Is the interpretation similar to one with categorical variables? – Jurgis Samaitis May 20 '19 at 16:26
  • Could you please be more specific about how you have created these variables? Your question currently reads as if they are *categories* of percentages rather than the percentages themselves. – whuber May 20 '19 at 16:31
  • Sorry for the confusion. These variables are the percentage of the population in a particular neighborhood, belonging to an age group. For example Neighborhood Age 0-14 value of 23 would mean that there is 23% of people in a neighborhood, who are between 0-14 years old. – Jurgis Samaitis May 20 '19 at 16:34
  • I see. I have been confused by an answer that is completely off the mark (which is no fault of yours). What you have is *compositional* data: proportions of a whole. But since the last variable, as you note, is completely determined by the first four variables, it makes no sense to ask what its effect is, because you cannot attribute separate effects to *any* of those variables. That's the basic problem of collinearity: it is impossible to keep all the other variables constant while changing just one of them. – whuber May 20 '19 at 16:40
  • I see, thank you very much for your answer - it makes much sense. I also haven't heard the term compositional data before, I will take a deeper look into it! Could you post your comment as an answer so I could accept it? Thanks again! – Jurgis Samaitis May 20 '19 at 17:10
  • I suspect you might be able to find answers to your question in existing threads by searching on "composition." Here's the first I found: https://stats.stackexchange.com/questions/68944/. – whuber May 20 '19 at 17:19

1 Answers1

-1

The the house price of having lets say: Neighborhood Age 44-64% is 0.008 more than having Neighborhood Age >64%. Take note, that when adding dummy variables in a linear model:

$y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3D $ where $D$ is a dummy variable, can be re-written into :

$y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3 $ when D = 1

$y = \beta_0 + \beta_1X_1 + \beta_2X_2 $ when D = 0

So the difference in y in the model with and without the categorical predictor is simply just $\beta_3$ which is estimated.

Kane Chua
  • 186
  • 3
  • 1
    Thank you for the answer! The problem I'm having is that these are continuous variables, therefore D in your example would range from 0 to 100. Is the interpretation similar in that case? – Jurgis Samaitis May 20 '19 at 15:12