0

I want to estimate a regularized lasso regression model using nested cross validation to determine the best lambda and to get an estimate of internal validity. Specifically I have a binary outcome (event takes place:yes or no) and a large number of predictor variables. I would like to predict outcome. My problem is that my cases are twins and pairs of twins respond similar and score similar on predictor variables. Data are therefore dependent. Using a logistic regression I would use a random effects model to control for the dependency but this is not possible with a regularized logistic regression. My question now is: What is the best way to perform model selection and performance assessment? How would I handle the dependency during (nested) cross-validation?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Daniel
  • 61
  • 6
  • Can you please give details of your model? – kjetil b halvorsen Jul 09 '18 at 13:52
  • It,s a simple logistic regression with an event as outcome (yesor no) and a number of predictors. The problem is that I have 100 twin pairs and data are dependent because twins will have similar predictor and outcome values. – Daniel Jul 09 '18 at 21:11
  • 1
    Can you please add this new information to the original post? Few people read comments ... and, if you can post (or give alink to) (part of) the data, we can experiment ... My first thought is a mixed logistic model with a random intercept for each twin pair. But there are other possibilities . – kjetil b halvorsen Jul 09 '18 at 21:46
  • See similar questions: https://stats.stackexchange.com/questions/224701/random-effects-vs-fixed-effects-in-twin-studies https://stats.stackexchange.com/questions/58806/mixed-effects-model-for-mz-twin-data-avoiding-overparametrization https://stats.stackexchange.com/questions/321748/does-the-mann-whitney-u-test-need-to-be-adjusted-in-twin-studies https://stats.stackexchange.com/questions/60490/choice-of-path-weights-in-sem-conceptual-models-for-identical-fraternal-twins and search this site for "twin" – kjetil b halvorsen Jul 10 '18 at 09:55
  • Thanks for your clarifivation of my problem. Your suggestion of splitting pairs in training and test sample make sense. I hopesome other experts will provide their input. – Daniel Jul 11 '18 at 12:25
  • 2
    I don't think that splitting the pairs so one member go to test set, other to trainset is a good idea. That makes impossible estimation of dependency between pair members, and also its testing. Whole pairs should go to either test or train. – kjetil b halvorsen Jul 11 '18 at 13:31
  • Kjetil, you are right. I am sorry but I did not explain my answer very well. I exactly meant the same as you wrote. Thanks for spotting it. I will edit my answer to avoid any confusion! – Stats_Monkey Jul 12 '18 at 08:54
  • 1
    Correction: Random effects models are not yet fully developed for regularized methods. The only paper I am aware of is "Groll, A. and G. Tutz (2014). Variable selection for generalized linear mixed models by L1-penalized estimation. Statistics and Computing 24(2), 137–154". Regarding cross-validation: Perhaps: Keep pairs of twins together when you split your data into training and test sets for cross-validation. This avoids that information of a twin in the training data set is used to predict the outcome of the very similar sibling in the test data set. – Stats_Monkey Jul 12 '18 at 09:01
  • Comment: I removed my original answer because it was misleading and I could not edit it anymore – Stats_Monkey Jul 20 '18 at 15:40
  • 1
    Regarding the question about how to split the data into a training and test set, the following reference might be of use: Roberts, D. R., Bahn, V. , Ciuti, S. , Boyce, M. S., Elith, J. , Guillera‐Arroita, G. , Hauenstein, S. , Lahoz‐Monfort, J. J., Schröder, B. , Thuiller, W. , Warton, D. I., Wintle, B. A., Hartig, F. and Dormann, C. F. (2017), Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. *Ecography*, 40: 913-929. doi:10.1111/ecog.02881 – Phil Oct 29 '18 at 09:07

0 Answers0