0

I have a data with 10 variables (continuous with log transformed values) that I am using to accurately predict in a 3 class classification.

I used RF model to select those 10 variables by first dividing the data in to 60 % (train)-40 %(validate) proportion. Now I want to plot ROC curve to further validate that my 10 variables are good (sort of significance from AUC curves).

I see many papers using multi-variable ROC curve for classification but I am unable to find an exact answer, and I am using R to run the code as I don't have SAS.

One way I found is to make a generalized linear model (glm) using those 10 variables using 'binomial' regression. I converted my 3 class classification into three different 2 class classifications (A vs ALL, B vs ALL, C vs ALL) based on posts I read on this forum.

Then I used the fitted values from glm as 'prediction', and the class as 'labels' and plotted the ROC curve using basic prediction and performance commands.

I want to know the following:

1) Is there any other method to plot ROC curve with multiple variable or using glm fitted values is correct?

2) Shall I also run train and test for glm+ROC curve on 60-40% data? Or running on combined data is ok. Can I get some sort of help with the algorithm how to approach with that?

3) I see some methods where ROC curve finds a cut off value for combined variable values. I didn't understand that part and how to obtain that?

4) Finally, I used the same data on which I performed the RF training and test. Is it wrong to do it?

I have little to no background in advanced statistics hence I would appreciate any help.

Thanks!

Piyush Joshi
  • 15
  • 1
  • 7
  • If you generalize your problem removing your domain specific information, you're more likely to get solution. There are far more stats experts here on this platform than genetics experts... And try to be very specific in what exactly you want. – tired and bored dev Feb 16 '19 at 01:13
  • @tiredandboreddev thanks for your suggestion. I hope the post is more general now. – Piyush Joshi Feb 16 '19 at 17:32
  • Is there a natural ordering of your 3 classes, such as normal/carcinoma-in-situ/invasive in cancer? Or are they simply 3 separate classes? – EdM Feb 16 '19 at 17:39
  • Three molecular group of cancer. So I am looking for least number of molecular markers for with accurate diagnosis. – Piyush Joshi Feb 16 '19 at 18:20
  • These links will help you: https://stats.stackexchange.com/questions/2151/how-to-plot-roc-curves-in-multiclass-classification https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html https://stats.stackexchange.com/questions/21551/how-to-compute-precision-recall-for-multiclass-multilabel-classification You should first study about basic ideas of machine learning. These forums can't substitute a good book/tutorial. It can only fill the gaps in knowledge. – tired and bored dev Feb 17 '19 at 18:40
  • Should you really use GLM or any other method would be okay? Why only ROC, what about other metrics of machine learning? – tired and bored dev Feb 17 '19 at 18:48
  • Hi @tiredandboreddev. I have already used Random forest. Now I just want to use ROC for some test of significance, like AUC value. With regards to GLM, I just found one method using GLM, so that's why I am asking if there is any other method to combine 10 variables into a single prediction value that I can use for ROC based validation. Thanks for the links. I am looking for algorithms and reading related papers but as I also am trying to finish this project I asked for some advise. – Piyush Joshi Feb 17 '19 at 19:24

0 Answers0