2

A ran a regression analysis predicting Salary from gender. In the data Female was coded as 2 and male was coded as 1. Then I was asked to change females to -1 and male to 5. In the analyis ɑ, b, t,and SEb changed.

Why? What is the reasoning behind the coding system here?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
SLeca
  • 25
  • 4
  • We need more information. Why were you asked to change the coding? Is this part of an assignment? It does not make sense to code a dummy variable like this. – T.E.G. Feb 12 '17 at 03:12
  • Thank you for responding. It is part of an assignment and I think the whole point is explaining why some values changed when the coding changed. I am assuming it has to do with the numbers used for coding. I have read that it is common to use Male= 0 and Female= 1 (where male is the reference group). – SLeca Feb 12 '17 at 03:25
  • Yes, that is the common practice. If the sole aim of the assignment is to show the reason why we use 0 and 1, then I think the question is now redundant. – T.E.G. Feb 12 '17 at 03:29
  • This question might be of interest: http://stats.stackexchange.com/questions/16689/why-is-gender-typically-coded-0-1-rather-than-1-2-for-example – T.E.G. Feb 12 '17 at 03:30

1 Answers1

1

It is hard to see without further information why one would lie to code a binary variable as $(-1,5)$, but it is fairly easy to see how the coefficient changes with a simple experiment:

lets create a random data.frame in R with 100 observations, where salary has a mean of 60K with a standard deviation of 15K:

   set.seed(10)
df <- data.frame(salary = rnorm(100, mean = 60000, sd = 15000), gender = rbinom(100, 1, 0.42))
df$gender5 <- ifelse(df$gender == 0, -1, 5)

Now gender is coded $(0,1)$ and gender5 is coded $(-1,5)$. Lets regress salary with gender with the original encoding and with the new one:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    61291       1994  30.736   <2e-16 ***
gender         -6421       2765  -2.322   0.0223 *  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  60220.7     1692.1  35.589   <2e-16 ***
gender5      -1070.2      460.9  -2.322   0.0223 * 

So:

coefficient: The coefficient has a simple meaning always - whats the average difference in salary between the two categories. The first coding $(0,1)$ is very intuitive and so is used often, and is easily understood when viewed through the regression equation: $\hat{salary}=61,291-6,421\times gender$. If males are coded $1$ and females $0$, than males predicted average salary is $61,421-6,421\times 1=54,870$ or simply $6,421$$ less than females.

When the coding changes, so does the meaning. Now instead a gap of $1$, we have a gap of $6$. Now if we want to predict men, we will do: $\hat{salary}=60,220.7-1,070.2\times 5 = 54,870$. Exactly the same (with a rounding error). The gap is not $1$ now, but $6$. Multiplying slope coefficient by $6$, e.g., $-1,070.2\times 6=-6,421$ and we arrive back at the slope coefficient using the first coding scheme $(0,1)$. This is just much less intuitive to calculate.

Standard Error: Same shtick. The $s.e.$ is dependent on the distribution. if you change it, you change the deviation. so $2765/6=460.9$

T and significance value: Should not change. If it did, there probably is a problem somewhere. re-coding the variables changes the coefficients, but not the significance values.

Thomas Bilach
  • 4,732
  • 2
  • 6
  • 25
Yuval Spiegler
  • 1,821
  • 1
  • 15
  • 31
  • @SLeca you are very welcome. If you feel this is satisfactory, feel free to accept this answer which will mark the question as resolved :) – Yuval Spiegler Feb 17 '17 at 13:38