Splitting a small data set

Question

I'm asked to build a GLM model from 89 known specimens to predict the group membership of the remaining 199 specimens. I have 288 specimen’s in total with the response variable having three levels. I need to use cross-validation to predict the accuracy of my model.

a sample Data

    Group           M1      M2      Fora    Phone   len     height  Rost
1   multiplex       2078    1649    1708    3868    5463    2355    805
2   subterraneus    1749    1482    1462    3797    4855    2218    765 
3   unknown         1841    1562    1585    3750    5024    2232    821

I'm building a logistic model to predict unknown from the other two groups.

My understanding is I need to split the data into two sets. One data set to train the model which consists of 89 rows and another 199 rows to test and predict the unknown. But the flow in this is the fact that my training set is so small and my testing set is larger.

My question is Can I include Unknown rows into my model and then predict them or will I just use the multiplex and subterraneous portion of the data in my logistic model?

I don't think including "unknowns" in the data is a good idea. People do it, but if the outcome is unknown you are likely biasing results and diluting performance. If your confident there are only two outcomes, leave the unknowns to be predicted later and train only on known instances. — Demetri Pananos, Dec 01 '20 at 14:02
I don't follow this. What is the response variable? What are the three levels? Are they, `multiplex`, `subterraneus`, & `unknown`? It sounds like you either need ordinal logistic regression or multinomial LR. — gung - Reinstate Monica, Dec 01 '20 at 14:09
The response variable is "Group". Yes, levels of the response variable are multiplex, subterraneus, & unknown. we want to predict 199 unknown observations and see whether they are multiplex or subterraneus. I'm confused about how to split my data. if I exclude 199 observations that will be predicted, I will end up with 89 observations to train the model. @gung-ReinstateMonica — Mo.Muse, Dec 01 '20 at 14:15
Is `unknown` a third group, or are those species where you don't know if they're `multiplex` or `subterraneus`? — gung - Reinstate Monica, Dec 01 '20 at 16:45
yes, "unknown" is the third group and it is the group we want to predict. — Mo.Muse, Dec 01 '20 at 16:58
If you have the 3 `Groups` (multinomial model) defined for all 288 specimens , this is a very small number of cases to subject to a single train/test split. See [this page](https://stats.stackexchange.com/q/50609/28500) and its links, for example. Model performance would be better estimated with repeated cross-validation or bootstrapping. On the other hand, if you only have `Group` data on 89 specimens, then building the model on those 89 and applying it to the other 199 would make sense. In that scenario, however, you would already know which 89 to build the model from. — EdM, Dec 01 '20 at 18:16
@EdM The second scenario applies to my data. Your explanation makes sense. Thank you — Mo.Muse, Dec 01 '20 at 18:38

Splitting a small data set

0 Answers0