0

My data has 50 features like... 1. student_num 2. sub1 3. sub2 4. sub3 .. .. 50. hasFailed

There's one row for each student. I have 12000 students so 12000 records. My goal is to reduce dimensional. I need to predict if a student will fail.

Should I remove student_num before PCA?

Emma
  • 103
  • 2
  • 1
    Can you say more about your situation, your data, & your goals? Why are you doing PCA? Is `student_num` just a unique identifier for each individual student (1st student, 2nd student, etc)? Why would that be in your dataset, do you have multiple rows for some (every) student? – gung - Reinstate Monica Jan 31 '19 at 19:15
  • @gung: Added more details – Emma Jan 31 '19 at 19:22
  • 6
    If the student ID is just a convenient identifier, then it's irrelevant. For example, students' phone numbers have no relationship to academic achievement, so likewise you wouldn't include the phone number as data. – Sycorax Jan 31 '19 at 19:48
  • Yes. It isn't part of the phenomena which you are studying. – Demetri Pananos Jan 31 '19 at 19:43
  • 1
    A more interesting question is why you think you need to do a PCA before doing your logistic regression. – mdewey Feb 01 '19 at 13:32
  • 2
    @Sycorax The student ID is not necessarily irrelevant: see Lord's paper on football numbers. I have encountered such situations. In one, soil samples were identified with sequential integers--and those identifiers were extremely useful for modeling the results, because for practical reasons the samplers did not move around randomly and thus the IDs were a good proxy for spatial proximity. In the student case, ID could be related to when the student enrolled and that could be related to all kinds of meaningful characteristics. – whuber Feb 01 '19 at 15:46
  • 2
    @whuber Fair point. I leapt to the conclusion that student IDs are independent from the data, and therefore irrelevant, which may not hold true in some circumstances. – Sycorax Feb 01 '19 at 16:02
  • 1
    @Sycorax The moral is that it's often a good idea to throw in every variable you have when first exploring the data, no matter what those variables represent. Sometimes amusing and useful relationships show up! – whuber Feb 01 '19 at 16:03
  • But only if you carefully study the results, and not blindly use them. If you find such patterns, it's better to out them into appropriate features, such as enrollment year, instead... For a beginner, removing identifiers that just happen to be numeric is the better approach. – Has QUIT--Anony-Mousse Feb 03 '19 at 13:17
  • Apparent relationships could be spurious and the example that @whuber presents is interesting but unusual. – Michael R. Chernick Feb 11 '19 at 03:51
  • @Michael I won't disagree, but only want to wonder aloud whether such relationships are unusual because few people think to look for them :-). – whuber Feb 11 '19 at 13:02

1 Answers1

0

PCA should only be used on interval type variables.

These are variables where differences and division make sense. Because PCA performs such operations.

So given the student is example: does it make sense to compute the differences of student IDs? Does it make sense to argue that the student IDs of Bob and Sally differ by twice as much as those of Adam and Eve? If not, then you probably shouldn't use PCA not standardization.

It doesn't mean such variables are useless. They may just need to be handled differently. For example phone numbers (used to) have a meaningful area code. Treating phone numbers with PCA is stupid, but treating the area code as a category is okay. It's even better when you can map them to actual cities...

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • It seems difficult to justify this answer ("PCA should only be used on interval type variables") in light of the effectiveness of PCA even on binary variables. – whuber Feb 03 '19 at 14:26
  • Effectiveness in what sense? My experiences with binary variables and PCA are not all positive... – Has QUIT--Anony-Mousse Feb 04 '19 at 06:22
  • See https://stats.stackexchange.com/search?q=pca+binary+score%3A1, such as https://stats.stackexchange.com/questions/159705 and https://stats.stackexchange.com/questions/16331. – whuber Feb 04 '19 at 14:17
  • Well, neither seems to claim that PCA is particular "effective" on binary variable, but they suggest to look into alternative approaches... – Has QUIT--Anony-Mousse Feb 04 '19 at 18:57
  • @Anony-Mousse: Thanks! – Emma Feb 04 '19 at 20:53