Newbie here. I'm experimenting with the following dataset:
https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation
Data Set Information: The data consist of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly equal-sized categories ("low", "medium", and "high") to form the class variable.
Attribute Information:
- Whether of not the TA is a native English speaker (binary); 1=English speaker, 2=non-English speaker
- Course instructor (categorical, 25 categories)
- Course (categorical, 26 categories)
- Summer or regular semester (binary) 1=Summer, 2=Regular
- Class size (numerical)
- Class attribute (categorical) 1=Low, 2=Medium, 3=High
The data looks like that:
1,23,3,1,19,3
2,15,3,1,17,3
1,23,3,2,49,3
1,5,2,2,33,3
2,7,11,2,55,3
2,23,3,1,20,3
2,9,5,2,19,3
...
I want to cross 2 features (
dataset_bin['courseHasNativeTA'] = dataset_con['courseHasNativeTA'] = dataset_con['engNativ'] + dataset_con['course']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,10))
sns.countplot(y="courseHasNativeTA", data=dataset_bin);
The problem is that it seems to make no sense as the courses are supposed to be identified from 1 to 26 yet it goes from 2 from 28. I suspect the problem coming from the fact that engNativ and course are treated as numerical features instead of categorical.
I read this related question but I not quite sure about how to apply it to my problem.
Any insight one this? Thanks