2

I tried to find an answer to my question but I didn't find the right one; if you think there is already an answer please write the link.

I am using a national survey to study the investment in complementary pension in R: the original df is composed by several variables and around 50k observations from 2002 to 2014 (I used only the families that were interviwed at least two times (panel)).

    year family  comp     sex   study type_degree 
1   2002 104002     1      2      2      NA       
2   2002 107090     2      1      3      3       
3   2002 111052     1      2      1      NA      
4   2002 111052     3      2      2      NA      
5   2002  11940     2      2      3      1      
6   2002  11972     2      2      3      2      
7   2002 121040     1      1      1      NA     
8   2002 121040     2      2      2      NA     
9   2002 136061     1      1      3      1      

where comp is the component of the family (mother, father, son..), study is the education level (1 for low education level, 2 medium, 3 degree); type_degree (1 economics, 2 maths, 3 medicine...). Type of degree is present only if level study is 3 (if the individual has a degree), in the other cases is NA.

I factorized the variable using the factor command in this way:

df$study <- factor(df$study)

In this case I had no problems since I have a value for each observation (no NA). For the type_degree variable I did in this way:

df$type_degree <- ifelse(df$type_degree=="1",1,0)

where 1 is the value for the graduation in economics (I want to study if graduted in economics behave in a different way than other graduated). In this case I have also NA values because not all observations (individuals) have a degree; so I tried to managed NA using the na.action in the regression, like this:

eq <- lm(pip ~ sex + study + area + type_degree, data=df, na.action=na.exclude)

where pip is the complementary pension type, area is the living area in the country (nord, centre, south).

I factorized the variables but R signals the error contrasts can be applied only to factors with 2 or more levels

and I supposed it was determined by the fact the type_degree has also NA values, but now I don't know another way to manage NA.

Thank you in advance.

Laura R.
  • 23
  • 1
  • 7

1 Answers1

3

type_degree is a categorical variable that you have recoded into numbers and the out-of-place "NA" which is just another factor in type_degree. R has its own way to determine that an observation is NA, I never enter "NA" directly but let R figure it out wherever there are missing values which means that they are actually missing. You should not use "NA" to refer to the value not available. Inasmuch as it is a categorical variable, there is no advantage in converting it into numbers, you just can leave it as "economics", "maths", "medicine", etc. and "not_available".
Reading from your question, you are interested in comparing type_degree == 'economics' to type_degree == 'something_else'. I would add another variable to the data frame such as c_type_degree and set to 'economics' if that is what it is or to 'other' for all the other types. That leaves only to types of values in c_type_degree which is much easier to deal with. You would be regressing on c_type_degree. You can factorize on type_degree once you remove the offending "NA"s in which case you will have several factors; or factorize on c_type_degree in which case you get only 2 factors. Interpreting regression coefficients after factorization is straightforward but I prefer to use dummy variables. In your case you only need 1 dummy variable to discriminate between c_type_degree == 'economics' and c_type_degree == 'other'. Using the dummy variable allows to determine whether there are differences in slopes and intercepts between the 2 groups and the p-values for the differences.
This question Ways of comparing linear regression interepts and slopes? explains the use of dummy variables and how to interpret them. Keep in mind that R does the same thing. Hope this helps.

LDBerriz
  • 535
  • 3
  • 9